<<

SEARCHING FOR INFORMATION ON OCCUPATIONAL ACCIDENTS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate

School of The Ohio State University

By

Shih-Kwang Chen, M.S.

*****

The Ohio State University 2008

Dissertation Committee: Approved by Professor Philip J. Smith, Adviser

Professor Jerald R. Brevick Adviser Professor Steven A. Lavender Industrial and Systems Engineering Graduate Program

Copyright by Shih-Kwang Chen 2008

ABSTRACT

Effective retrieval of the most relevant documents on the topic of interest from the

Internet is difficult due to the large amount of information in all types of formats. Studies have been conducted on ways to improve information retrieval (IR). One approach to improve searches in large collections, such as the Web, is to take advantage of semantic representations in pre-existing relational databases that have been developed for explicit purposes besides supporting Internet searches in general. In an effort to enhance IR on the Internet, a prototype of a topic-oriented search agent, SAOA-1, was developed to use embedded semantics and domain-specific knowledge extracted from such a database.

Activated when a set of retrieved keywords appears related to the topic of

“occupational accidents”, SAOA-1 constructs an alternative search query and pruned lists of suggested refinements by applying the search engine knowledge and the domain- specific knowledge and semantics extracted from a relational database. Information seekers could then use the alternative search query or refine it further with a modified search query developed by SAOA-1 based on its semantic representation of the topic of occupational accidents to complete context-sensitive pruning of the semantic neighborhood.

ii An empirical study was conducted to evaluate the usefulness of SAOA-1 in assisting information seekers to retrieve relevant documents. Sixty participants were randomly assigned to one of two treatments: with or without the assistance of SAOA-1, to perform Internet searches. Prior to performing searches, each participant had to decide upon a topic based on two given articles addressing in the workplace. The participants then performed searches and, when satisfied with the results, evaluated the relevance of the first forty documents of their final search to their research topics on a 1- to-5 scale Likert scale. It was hypothesized that the treatment type could have an overall effect on expected rating. Based on the data collected, the average expecting rating was statistically significant (p < 0.001) with a slight improvement of 10% due to the treatment.

From a practical perspective, however, the size of the effect was modest. The findings suggest that a topic-oriented search agent might be useful in assisting information seekers to retrieve more relevant documents but suggest important directions for further evaluating methods and settings for taking advantage of such semantically-based search techniques.

iii

Dedicated to my family

iv

ACKNOWLEDGMENTS

This dissertation could not have possibly been written without the dedicated support from my advisor, Dr. Phil Smith. My many thanks and appreciation to him for

“adopting” me as his graduate student. I am deeply indebted to him for his guidance, challenge, encouragement, patience, and his help in many ways. Thank you, Dr. Smith!

I am grateful to Drs. Jerald Brevick and Steven Lavender for serving on my committee and their friendship and contribution. I wish to thank Drs. Mark McCord

(Civil Engineering), Jane Fraser, and Gary Maul who had advised me during prior stages of my study. I wish to thank the faculty and staff of the Industrial, Welding, and Systems

Engineering department for their friendship during my years in the department, especially

Drs. Clark Mount-Campbell, Jose Castro, James Connors (Agriculture Education), David

Dickinson, Ralph Gardner III (Special Education), Julie Higle, Robert Lundquist, and

Allen Miller who had been involved in my program. I wish to thank Ms. Pam Hussen for always making me feel welcomed even during her busiest hours. I wish to thank Dr.

William Harper for his encouragement and for spending his precious consulting time with my questions on statistics. I would like to thank Mr. Tim Watson at Graduate School for his generous assistance.

v I want to express my gratitude to Mr. Cedric Sze for his moral and financial supports and sharing his experiences and knowledge all these years. He has been my boss, my mentor, and my good friend. He is always positive and looks at things from the brighter side. I wish to thank my colleagues in the department’s computer lab office for their friendship and expertise in computers. I have significantly enhanced and broadened my computer skills by having worked with them.

I wish to thank Dr. Mohammed Rahman at Office of Information Technology for statistics consulting service. I wish to thank Miss Laurie Brevick as my editor. I want to thank my neighbors Kelly Jeppensen (Doctor-to-be) and Jeff Dotson (Ph.D.-to-be) for our frequent discussions on statistics. The final data analysis was output by Stata which was installed on Kelly’s laptop.

I feel deeply blessed to have had many other friendships with special thanks to

Scott and Angie Kelly for being such a wonderful neighbor. I’d like to thank Peir-En Yeh and her mother for treating my family as part of their family.

I want to thank my buddy John Hoffman, Jr. He is my “twin” brother, my dictionary of English, and my map of the U.S.A. Thanks, Jack, for always being there for me!

Finally, I want to thank my wife Amy and sons Jerem and Jed, my mother-in-law

Chiou-Tzu Lin Wang, my mother Nancy and father Sankuo, my sister Alice and her family: Oliver, Dennis and Debbie, and my other sisters Joyce and Doris for their love and support. Special thanks to my Mom for helping Amy and me throughout these past two years.

vi

VITA

2000 ...... M.S. Industrial and Systems Engineering, The Ohio State University

FIELDS OF STUDY

Major Field: Industrial and Systems Engineering

vii

TABLE OF CONTENTS

Page Abstract ...... ii Dedication ...... iv Acknowledgments ...... v Vita ...... vii List of Tables ...... xi List of Figures ...... xiii

Chapters:

1. Introduction ...... 1

2. Literature Review ...... 3

2.1 Introduction ...... 3 2.2 General IR techniques ...... 5 2.2.1 Relational models ...... 6 2.2.2 Character string models ...... 9 2.2.2.1 Boolean models ...... 9 2.2.2.2 Stemming ...... 10 2.2.3 Statistical models ...... 11 2.2.3.1 Vector space models ...... 13 2.2.3.2 Latent semantic indexing models ...... 15 2.2.3.3 Probabilistic models ...... 16 2.2.4 Models based on browsing ...... 19 2.2.5 Semantic representations ...... 23 2.2.5.1 Thesaurus ...... 23 2.2.5.2 Semantic Web ...... 34 2.3 Post-retrieval processing ...... 38 2.3.1 Ranking ...... 38 2.3.2 Refinements ...... 39 2.4 Search assistance provided by search engines ...... 41 2.5 Summary ...... 50

3. Software design: conceptual development and implementation ...... 54

3.1 Introduction ...... 54

viii 3.2 Test results with the assistance of SAOA-1 ...... 60 3.3 The assistance of SAOA-1 ...... 74 3.3.1 The OSHA database ...... 74 3.3.2 Search agent role in refining terms and increasing results relevancy ...... 78 3.4 Implementation methods ...... 95 3.4.1 Knowledge base preparation ...... 95 3.4.2 Software development ...... 97 3.4.2.1 Tools ...... 97 3.4.2.2 Software development ...... 98 3.4.2.3 The interface ...... 100 3.5 Functionalities of the agents ...... 100 3.5.1 Functionalities of SAOA-1 ...... 101 3.5.2 Functionalities of SAOA-0 ...... 104

4. Empirical evaluation ...... 108

4.1 Introduction ...... 108 4.2 Methods ...... 109 4.2.1 Participants ...... 109 4.2.2 Questionnaire ...... 109 4.2.3 Procedure ...... 111 4.2.4 Analyses of data ...... 112 4.3 Results ...... 115 4.3.1 Demographic information ...... 116 4.3.2 Research findings ...... 127 4.3.2.1 Control group vs. treatment group ...... 128 4.3.2.2 Control group vs. Alternative Search group ...... 130 4.3.2.3 Control group vs. Modified Search group ...... 131 4.3.2.4 Control group vs. “Four Buttons” group ...... 131 4.3.3 Search behaviors ...... 133 4.3.3.1 Average length of initial queries ...... 134 4.3.3.2 Number of searches ...... 140 4.3.3.3 Search patterns ...... 141 4.3.4 Usability of the interface ...... 165

5. Discussion and conclusions ...... 171

5.1 Introduction ...... 171 5.2 Summary of research findings ...... 172 5.2.1 Related studies ...... 173 5.2.2 Main effects ...... 173 5.2.3 Search behaviors ...... 175 5.2.4 Usability of the interface ...... 176 5.3 Significance of the study ...... 177

ix 5.4 Limitations of the study ...... 178 5.5 Future researches ...... 179

List of References ...... 181

APPENDICES:

APPENDIX A Investigation summary form (OSHA-170 Form) ...... 192

APPENDIX B Investigation summary codes ...... 194

APPENDIX C Questionnaires ...... 199

APPENDIX D Answers to open-ended questions ...... 222

x

LIST OF TABLES

Table Page

2.1 Real-time search assistance by Google ...... 42

2.2 Real-time search assistance by Live Search ...... 44

2.3 Real-time search assistance by Yahoo ...... 49

2.4 Real-time search assistance by Ask ...... 52

3.1 A summary of the first ten search results for “hand” and “cut from Google, Yahoo, and Live Search ...... 56

3.2 A summary of the number of relevant documents in the first ten search results from Original Search, Alternative Search, and Modified Search, respectively ...... 73

3.3 Controlled vocabulary for the semantic class “Part of Body” for OSHA’s Investigation Summaries Form ...... 76

3.4 Sample of Line: Fred Meyer Stores, Inc...... 77

3.5 Sample of Injury Line: Sugarhouse Barbecue ...... 77

3.6 Concepts associated with the semantic class “Source of Injury” ...... 88

3.7 Sample of pruned lists of terms showing concept relationships across semantic classes using hand related injuries ...... 89

3.8 Injury Line: Definzio Imports, Inc...... 91

3.9 Injury Line: Daisy Publishing Company Inc...... 91

xi 3.10 Sample of a pruned list for suggested refinements using hand related injuries ...... 93

4.1 Gender distribution of participants ...... 116

4.2 Age distribution of participants ...... 118

4.3 Number of college students among the participants ...... 119

4.4 Highest level of education completed by the participants ...... 120

4.5 Internet usage history among participants ...... 121

4.5 Number of locations for Internet access ...... 122

4.7 Internet access locations ...... 123

4.8 Internet search frequency ...... 124

4.9 Group and overall Internet search engines preferences ...... 126

4.10 List of search topics and initial queries ...... 135

4.11 Search patterns in the control group (sorted by the number of searches per participant) ...... 143

4.12 Search patterns in the treatment group (sorted by the number of searches per participant) ...... 144

4.13 Four different types of search patterns in the treatment group (sorted by the number of searches per participant) ...... 145

4.14 The search outcomes related to repetitive stress injury from the first forty results of Participant 015’s sixth search ...... 165

4.15 Descriptive statistics for Questions 1 to 4 in the treatment group ...... 166

4.16 Descriptive statistics from Questions 1 and 2 in the control group ...... 167

xii

LIST OF FIGURES

Figure Page

2.1 A screen image of the real-time search assistance (“Related searches” and “Search related to”) from Google ...... 43

2.2 A screen image of the real-time search assistance (“Related searches”) from Live Search ...... 45

2.3 A screen image of the real-time search assistance (“Search Assist”) from Yahoo ...... 46

2.4 A screen image of the real-time search assistance (“Also try”) from Yahoo ...... 47

2.5 A screen image of the real-time search assistance (“Search Assist”) from Yahoo ...... 48

2.6 A screen image of the real-time search assistance (“Search Suggestions”) from Ask ...... 51

2.7 A screen image of the real-time search assistance (“Narrow Your Search” and “Expand Your Search”) from Ask ...... 51

3.1 A screen image of the first ten search results for “hand” and “cut” from Google ...... 57

3.2 A screen image of the first ten search results for “hand” and “cut” from Yahoo ...... 58

3.3 A screen image of the first ten search results for “hand” and “cut” from Live Search ...... 59

3.4 A screenshot of the Homepage of SAOA-1 ...... 61

xiii 3.5 A screen image of the first ten search results for “hand” and “cut” from the Original Search ...... 62

3.6 A screen image of the partially zoomed in version of Figure 3.5 for readability (Original Search) ...... 63

3.7 A screen image of the first ten search results for “hand” and “cut” from Alternative Search ...... 65

3.8 A screen image of the partially zoomed in version of Figure 3.7 for readability (Alternative Search) ...... 66

3.9 A screen image of the list of Suggested refinements for Alternative Search ...... 67

3.10 A screen image of the partially zoomed in version of Figure 3.8 for readability (Suggested refinements for Alternative Search) ...... 68

3.11 A screenshot of the dialog box displaying the new query for Modified Search ...... 69

3.12 A screen image of the first ten search results for “hand” and “cut” from Modified Search ...... 71

3.13 A screen image of the partially zoomed in version of Figure 3.12 for readability (Modified Search) ...... 72

3.14 Google’s first ten search results of keywords “occupational accidents” ...... 82

3.15 Google’s first ten search results of keywords “accidents occupational” ...... 83

3.16 Google’s comments on the AND operator ...... 84

3.17 A screenshot of the dialog box displaying the new query for Modified Search ...... 96

3.18 A screen image of the first ten results from SAOA-0 ...... 106

3.19 A screen image of the partially zoomed in version of Figure 3.18 for readability (SAOA-0) ...... 107

4.1 Gender distribution of participants ...... 117

4.2 Age distribution of participants ...... 118

4.3 Number of college students in each group ...... 119

xiv 4.4 Highest level of education completed by participants ...... 120

4.5 Internet usage history among participants ...... 121

4.6 Number of different locations to access the Internet ...... 122

4.7 Internet access locations ...... 124

4.8 Internet search frequency ...... 125

4.9 Internet search engine preferences by group ...... 126

4.10 Overall Internet search engine preferences ...... 127

4.11 Histogram depicting the average rankings for both groups ...... 129

4.12 Interaction plot for the average rating score of each of the forty documents by group ...... 129

4.13 Histogram depicting the length of initial queries by groups ...... 138

4.14 A graphic summary of average length of initial queries ...... 139

4.15 Histogram depicting number of searches by group ...... 141

4.16 A screen image of Participant 022’s first search outcomes ...... 147

4.17 A screen image of Participant 022’s second search outcomes ...... 148

4.18 A screen image of Participant 022’s third search outcomes ...... 149

4.19 A screen image of Participant 022’s forth (final) search outcomes ...... 150

4.20 A screen image of Participant 057’s first search outcomes ...... 152

4.21 A screen image of Participant 057’s second (final) search outcomes ...... 153

4.22 A screen image of Participant 012’s first search outcomes ...... 154

4.23 A screen image of Participant 012’s second (final) search outcomes ...... 155

4.24 A screen image of Participant 015’s first search outcomes ...... 157

4.25 A screen image of Participant 015’s second search outcomes ...... 158

xv 4.26 A screen image of Participant 015’s third search outcomes ...... 159

4.27 A screen image of Participant 015’s fourth search outcomes ...... 160

4.28 A screen image of Participant 015’s fifth search outcomes ...... 161

4.29 A screen image of Participant 015’s sixth search outcomes ...... 162

4.30 A screen image of Participant 015’s seventh (final) search outcomes ...... 163

4.31 Histograms depicting Likert Scale rating of Questions 1 to 4 in the treatment group ...... 167

4.32 Histograms depicting Likert Scale rating of Questions 1 and 2 in the control group ...... 168

xvi

CHAPTER 1

INTRODUCTION

With advanced technologies and the huge growth of the Internet, the Web has become a massive collection of all sorts of information in all types of formats. One approach to improve searches in large collections is to take advantage of semantic representations. In different domains, relational databases have been developed for explicit purposes in addition to fundamentally supporting information retrieval (IR).

However, these databases contain information about the semantics relevant to the domain.

The semantics (i.e., the meanings of words) and knowledge embedded in these databases may be extractable and useful to support Internet searches. Thus, the fundamental research question to be addressed in this study asks:

Is it possible to take advantage of semantics within existing relational databases to

improve Web searches and information displays? In other words, can the

semantics for a given domain as represented in such existing relational databases

be leveraged to enhance information retrieval and display for that domain?

1 In this study, we developed a prototype of an IR system – a topic-oriented search agent that uses the semantics and knowledge from an existing relational database

(developed for purposes other than to support general Internet searches) in a given domain to improve Internet IR and display. The prototype, Search Agent on Occupation

Accidents – 1 (SAOA-1, working title: Agent Smith), extracts semantics and domain- specific knowledge from The Occupational Safety & Health Administration’s (OSHA, http://www.osha.gov/) Accident Investigation Search database

(http://www.osha.gov/pls/imis/accidentsearch.html) and seeks to enhance search over the

Internet on the topic of “Occupational Accidents.” If the information seeker’s submitted search query is identified as related to the topic “Occupational Accidents,” SAOA-1 will assist information seekers in refining their search query by suggesting refinements.

The organization of this dissertation is as follows. Chapter 2 presents a review of literature on general IR techniques, post-retrieval processing, evaluation of IR systems, and search assistance provided by search engines. Chapter 3 discusses the conceptual development, implementation, functionalities, capabilities, and interface design of

SAOA-1. Chapter 4 reports the methods and results from an empirical study evaluating

SAOA-1, including the procedures and the results summarizing research findings.

Chapter 5 concludes the dissertation with a general discussion that contains a summary of the primary findings, limitations of this study, and possible directions for extending this research.

2

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

Conventionally, IR has been a label used to refer to “document retrieval, and specifically to automatic document retrieval” (Sparck Jones, 1992, p. 159). IR “deals with the representation, storage, and access to documents or representatives of documents

(document surrogates)” (Minker, 1977, as cited by Salton & McGill, 1983, p. 7). In brief,

IR can be considered as the activities of retrieving useful knowledge (information) from a collection of information items through an IR system. Automated IR systems “were originally developed to help manage the huge scientific literature that has developed since the 1940s” (Frakes, 1992a, p.1). A straightforward definition of an IR system was given by Lancaster as:

“An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.” (Lancaster, 1968, as cited in van Rijsbergen, 1979, p.1)

IR activities used to rely on human intermediaries such as reference librarians, paralegals, and similar professional searchers. However, with the development of new

3 technologies and the use of the Internet, hundreds of millions of people engage in IR on the Internet, i.e., Internet searches, on a daily basis (Manning, Raghavan, & Schütze,

2008).

Broder and Henzinger (2002) suggest that Internet search is “a variant of classical information retrieval” (p. 5). Unlike classic IR, the documents on the Web are heterogeneous in quality and format, and lack stability. Furthermore, the size of the

“database” is enormous (Broder & Henzinger, 2002; Chowdhury, 2004). Commercial search engines are widely used to perform IR on the Internet. Google

(http://www.google.com/), Yahoo (http://search.yahoo.com/), Live Search

(http://www.live.com/), and Ask.com (http://www.ask.com/) are current well-known commercial search engines.

In general, IR involves a sequence of activities including query submission, document retrieval with IR techniques, and post-retrieval processing such as ranking and query refinement. One of the major challenges in using such search engines is that there is often a semantic gap between the information desired by the information seekers and the interpretation of their queries used by the IR systems to retrieve information (Smith,

1995; Maron & Kuhns, 1960; Ranganathan & Liu, 2006; Robins, 2000). In short, it is often difficult for an information seeker to translate his or her semantic topic of interest into a query that effectively retrieves the relevant information.

In this chapter, we review some approaches to these IR activities, as they are used by the search engines that we seek to enhance. The rest of this chapter is organized as follows. Section 2.2 reviews some fundamental IR techniques. Section 2.3 discusses

4 approaches for post-retrieval processing. Section 2.4 discusses the search assistance provided by search engines. Finally, Section 2.5 provides a summary of this chapter.

2.2 General IR Techniques

Many conceptual IR models and techniques have been presented in the field of IR.

These models include relational models (Salton & McGill, 1983), Boolean related models such as the extended Boolean model (Salton, Fox, & Wu, 1983), the vector space model

(Salton & McGill, 1983), cluster-based retrieval models (van Rijsbergen, 1979; Salton &

McGill, 1983), the latent semantic indexing model (Deerwester, Dumais, Furnas,

Landauer, & Harshman, 1990), probabilistic models (van Rijsbergen, 1979; Chowdhury,

2004; Crestani, Lalmas, van Rijsbergen, & Campbell, 1998), and models based on browsing (Wolfram, 2003). Techniques like stemming have also been developed to deal with the complexities of human languages. Summaries or detailed discussions of different models can be found in Baeza-Yates and Ribeiro-Neto (1999), Chowdhury (2004), Croft and Turtle (1992), Frakes (1992a), Grossman and Fireder (2004), and Wolfram (2003).

None of these techniques use true semantic representations. However, they all offer techniques that are potentially complementary to a search engine that makes use of semantic representations to support IR. These complementary non-semantic approaches are first discussed below, as they are used by search engines like Google as part of the retrieval process and will therefore influence the results of SAOA-1. This is followed by a discussion of the literature on semantically-based search methods as the core of SAOA-

1 takes advantage of a semantic representation.

5 2.2.1 Relational Models

This section provides some examples of existing systems that use databases with documents that are indexed using controlled vocabularies that have been organized using a relational model. (These databases may or may not actually use relational database technology to process queries, however.)

A relational database is a physical implementation (i.e., databases storing and manipulating information) of a relational model describing some aspects of the real world, based on a set of rules as first proposed by Codd (Codd, 1970; Riordan, 2005). In a relational database, the relationships between entities are derived prior to responding the queries (Salton & McGill, 1983). In relational models, a relation is simply a table in which all data (i.e., entities) are conceptually segmented into records (i.e., tuples or rows) in an orderly arrangement of fields (i.e., attributes or columns) (Riordan, 2005; Salton &

McGill, 1983). Because “each relation is characterized by its relation scheme, which is simply the ordered list of attribute names used to identify the records” (Salton & McGill,

1983, p. 366), the entities represent concepts or knowledge of the domain; defining class membership as each column in the database represents a concept class, and the semantic relationships among entities are available for extraction and use for IR.

There has been limited research concerned with improving search effectiveness within relational databases themselves. For example, Liu, Yu, Meng, and Chowdhury

(2006) proposed an IR-style ranking strategy for effective keyword search in relational databases which returned answers with basic semantics obtained from the relational structure. They argued that, even though the major Relational Database Management

Systems (RDMS) have provided full-text search capabilities, to effectively retrieve

6 information from these relational databases, they still require the knowledge of the database schema or structured query language (SQL), which is too complicated for ordinary users. However, by applying keyword search techniques such as IR-style ranking strategies in relational databases, ordinary users could retrieve relevant documents more effectively by submitting keywords. A lyrics database with information related to songs, converted from the data collected from a lyrics Web site, was used for their experiment. The lyrics database contained five tables: Table Artist with a column

Name , Table Album with a column Title , and Table Song with two columns Title and

Lyrics . Tables Artist-Album and Song-Album were the corresponding relationship tables for Table Artist and Table Album , and Table Song and Table Album , respectively. A set of fifty lyrics queries was randomly selected for evaluation from a subset of a one-week user query log from a commercial search engine. For each query, a set of relevant answers was identified prior to the experiment. The average precision was used to measure the overall effectiveness. They submitted each of the fifty lyrics queries to

Google and compared their results (from their relational database) against those obtained

(from the Web) by Google. They reported that their experimental result was “16.3% better than Google” in effectiveness (p. 565). They acknowledged that Google’s databases are different from theirs and noted that “although Google retrieves from text databases (not from relational databases), its big success necessitates a comparison” (p.

565).

Ranganathan and Liu (2006) also focused on improving IR in relational databases.

They developed a semantic query system, Minerva, that “extends relational databases with the ability to answer semantic queries” (p. 820) by applying various domain

7 knowledge contained in various ontologies, in addition to the database’s schema, to return semantically relevant results to a user’s query. The system took user queries as the input for a SPARQL query (i.e., a Semantic Web query language) and returned “three types of semantically relevant results: direct results, inferred results and related results”

(p. 820). The direct results were obtained directly from the database tables. The inferred results were generated through a description logic reasoning subsystem. The related results were obtained based on the data in the database and the pre-defined similarity of concepts in the ontologies. For example, the direct results for querying disaster aid workers in New Orleans would list aid workers explicitly listed in New Orleans in the database. The inferred results would return aid workers listed in towns in the states of

Louisiana, Mississippi, Alabama, and other states in the Gulf Coast region (assuming this knowledge, i.e., states in the Gulf Coast region, was contained in the ontologies). The related results would return aid workers in New Orleans and in the close by towns in the state of Louisiana. The system appeared promising. However, the evaluation of the performance of the system indicated that the time to initialize the system and to answer the queries increased linearly with the size of the database and therefore no further results were reported.

Although this Minerva system encountered issues with scalability of the system, both Minerva and SAOA-1 are intended to enhance IR by applying semantics and domain-specific knowledge associated with knowledge represented in a relational

(tabular) format. While Minerva applied certain Semantic Web technologies (i.e., ontologies and SPARQL query language) to enhance IR in a relational database, SAOA-1

8 automatically extracts semantics and domain-specific knowledge from a relational database from the Web to enhance IR on the Web.

Databases existing on the Web may be constructed based on relational models but may or may not use relational database technology to process queries. The database used by SAOA-1 is one of the databases administrated by OSHA. OSHA provides such a database based on a relational model for the general public to search for information related to occupational accidents (“Statistics & Data,” 2008). In particular, the Accident

Investigation Search database is based on such a relational model and is available for searching OSHA’s accidents inspection results (“Statistics & Data,” 2008).

2.2.2 Character String Models

In the preceding section, we noted that relational databases contain domain- specific semantic knowledge that can be used to assist with IR, either in the relational database itself or as a step in retrieving more general information from the Web. Before discussing this use of relational knowledge, we first review concepts applied more generally by current search engines.

2.2.2.1 Boolean Models

The classic Boolean model is based on set theory. The Boolean logic uses the

Boolean operations AND, OR, and NOT to construct queries. Proximity operations such as phrase searching can be used to specify the lexical distance between terms. In phrase searching, terms are constructed in a specified order. Exact matching returns documents that meet all the query criteria. A strict Boolean-based implementation assumes either a

9 document is relevant (matching all) or not relevant (not matching all). Extended Boolean models add relevance judgments by assigning weights (0 to 1) to document terms.

Documents are retrieved based on relevance values. A minimum threshold value may be set and only those documents with a relevance value over the threshold may be retrieved

(Wolfram, 2003).

2.2.2.2 Stemming

English words have various forms such as suffixes like -s, -es, -ed, -ing, -ation, etc. Stemming is the process of removing affixes, i.e., prefixes and suffixes, to reduce the word to a root form. Use of the stem form of a word for a query increases recall because the stem form will include all terms containing all of its variant forms in a document. For example, the word “ analy ” will occur more often than its variant forms such as analysis , analyze , analyzes , analyzed , analyzing , analyzer , etc. (Frakes, 1992a; Salton & McGill,

1983). However, stemming may also generate irrelevant words. For example, a query for reformation may return reformatories (reform schools) (Berry & Browne, 2005).

Stemming can be done in various ways. Frakes (1992b) describes four approaches for automatic stemming: affix removal, successor variety, table lookup, and n-gram; and, discusses Porter’s algorithm, i.e., morphological analysis, in details. Porter (2006) provides encodings of the Porter Stemming Algorithm (Porter, 1980) for public use.

10 2.2.3 Statistical Models

Statistical models are models that use the statistical distribution of terms, i.e., the frequencies of occurrence or co-occurrence of terms in documents to infer importance or possible semantic relationships (Broder & Henzinger, 2002; van Rijsbergen, 1979).

The frequency of occurrence of English words in a document can be characterized by Zipf’s distribution: frequency * rank ≈ constant (Salton & McGill, 1983). That is, by arranging distinct words in a decreasing order based on the frequencies of occurrence with most frequent words first, “the frequency of a given word multiplied by the rank order of that word will be approximately equal to the frequency of another word multiplied by its rank” (Salton & McGill, 1983, p. 60). Furthermore, Luhn (1958) pointed out that the frequency of occurrence of a word in a document may be used to as an indicator of significance. The idea, as described by Broder and Henzinger (2002), is that: a. “the more documents the term appears in, the less ’characteristic’ the term is for the document”, and b. “the more often the term appears in the document, the more

‘characteristic’ the term is for the document,” assuming stop-words, like “a,” “the,” etc. were excluded (p. 4). Thus, a list of these frequent words, i.e., “keywords,” may be used to represent or characterize the document. Moreover, the degree of significance indicated by these keywords provides a simple weighting scheme to represent the document (van

Rijsbergen, 1979).

Co-occurrence of words provides an indication of significance and relationships between words as well. Luhn (1958) also stated that “the more often certain words are found in each other’s company within a sentence, the more significance may be attributed to each of these words” (p. 160). Wolfram (2003) noted that “a high level of co-

11 occurrence between two terms provides an indication of a relationship between the two terms” (p. 106). In addition, “words which occur frequently with a given word may be thought of as forming a ‘neighborhood’ of that word” (Guthrie, Guthrie, Wilks, &

Aidinejad, 1991, p. 146). These semantic neighborhoods may be used to clarify the ambiguities of word sense of natural language. For example, the word virus co-occurs with the word worm would help to determine that the word worm is being used in the sense of “malicious computer program”, while if the word caterpillar co-occurs with the word worm, thus would help to determine the word worm as in the sense of “earthworm”

(Weeds & Weir, 2005). Sparck Jones studied the frequency of co-occurrence of associated keywords and showed that “such related words can be used effectively to improve recall, that is, to increase the proportion of the relevant documents which are retrieved” (van Rijsbergen, 1979, p. 5). Guthrie, Guthrie, Wilks, and Aidinejad (1991) suggested that semantic neighborhoods based on the co-occurrence of words for a given subject domain could be used for query reformulation to improve recall and precision in

IR.

Applications of applying statistical distribution of words in IR system include automatic indexing, automatic generation of thesauri (Wolfram, 2003), as well as statistical models such as the Vector Space Model, latent semantic indexing models, and probabilistic models (Ding, 2005). The vector space model is a well-known IR model.

The latent semantic indexing model is another statistical model building on the vector space model. Probabilistic models, based on probability theory, are briefly discussed in this section as well.

12 2.2.3.1 Vector Space Model

The vector space model (VSM) was developed by Salton and his colleagues and was first used in the SMART (System for the Mechanical Analysis and Retrieval of Text)

IR system. In a vector space model, terms and documents are represented as vectors.

“Each vector component (or dimension) is used to reflect the importance of the corresponding term/concept/class in representing the semantics or meaning of a document.” (Berry & Browne, 2005, p. 4). In the term-by-document matrices, most of the components would be zeros because each document often contains only a small subset of terms (in the term vector, to represent the document) given a large numbers of documents and a large numbers of terms. A particular threshold value (such as 0.5) of the similarity metric used for IR, such as cosine coefficient, etc., between queries and documents can be used to determine the number of documents to be retrieved (Berry & Browne, 2005;

Salton & McGill, 1983; Wolfram, 2003).

An example of applying VSM for IR by Berry and Browne (2005, pp. 30-34) is adopted for demonstration purpose as follows.

a. Construction of the document-by-term matrix

The document titles:

D1: Infant & Toddler First Aid D2: Babies & Children's Room (For Your Home) D3: Child Safety at Home D4: Your Baby's Health and Safety: From Infant to Toddler D5: Baby Proofing Basics D6: Your Guide to Easy Rust Proofing D7: Beanie Babies Collector's Guide

13 The terms :

T1: Bab(y, ies, y's) T2: Child(ren's) T3: Guide T4: Health T5: Home T6: Infant T7: Proofing T8: Safety T9: Toddler

The 7 x 9 document-by-term matrix (Note that the elements are the term weights for terms assigned to the document.):

0 0 0 0 0 1 0 0 1   1 1 0 0 1 0 0 0 0   0 1 0 0 1 0 0 1 0 DOC 7x9 = 1 0 0 1 0 1 0 1 1   1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0   1 0 1 0 0 0 0 0 0

b. The query :

For someone who is interested in looking for information on Child Home

Safety from this small collection, the query vector would be:

QUERY 1x9 = (0 1 0 0 1 0 0 1 0)

c. The similarity (cosine coefficient):

The similarity, i.e., cosine coefficient, between the query and the

document was defined as (Salton & McGill, 1983, p. 121):

14 t

∑(TERM ik ⋅QTERM jk ) k =1 COSINE (DOC i ,QUERY j ) = t t 2 2 ∑(TERM ik ) ⋅ ∑(QTERM jk ) k =1k = 1

where DOC i is a vector of TERM i1 , TERM i2 , ..., TERM ik , QUERY j is a vector of QTERM j1 , QTEM j2 , ..., QTERM jk TERM ik and QTERM jk are the weight, or importance, of term k assigned to DOC i and QUERY j, respectively.

The cosine coefficients are 0.0000, 0.6667, 1.0000, 0.2582, 0.0000,

0.0000, and 0.0000 for each of the seven documents, respectively.

d. The retrieval :

Document 2 (Babies & Children's Room For Your Home ), document 3

(Child Safety at Home ), and document 4 (Your Baby's Health and Safety : From

Infant to Toddler) are relevant. However, based on a 0.5 threshold value, only

document 2 and 3 will be retrieved.

2.2.3.2 Latent Semantic Indexing Model

The latent semantic indexing (LSI) model proposed by Deerwester, Dumais,

Furnas, Landauer, and Harshman (1990) suggests that, because of the underlying latent semantic structure in the data, the relevance of a document can be based on the similarity of concepts instead of the similarity of terms as used in the vector space model. A document could be relevant and retrieved by LSI even if it does not contain any of the terms in the query. The key for LSI is to model the underlying associations and

15 relationships of the terms and the documents. In LSI, the initial term-document space is formed as in the vector space model and queries are formed as pseudo-document vectors.

Through a process of singular value decomposition (SVD), the dimension of the initial term-document space is reduced to a much smaller semantic concept-document space. By reducing the dimension of the spaces, LSI may improve recall and precision in IR (Berry

& Murray, 2005; Chen, Zeng, & Tokuda, 2006; Deerwester, Dumais, Furnas, Landauer,

& Harshman, 1990; Ding, 2005; Efron, 2005; Papadimitriou, Raghavan, Tamaki, &

Vempala, 1998; Wolfram, 2003).

2.2.3.3 Probabilistic Models

Probabilistic models compute a similarity coefficient (or term weighting) between a query and a document. Probability theory is used to determine the likelihood of how relevant a document is to a query. Documents are ranked in decreasing order of the evaluated probability of relevance to the query (Baeza-Yates & Riberio-Neto 1999;

Chowdhury, 2004; Grossman & Frieder, 2004; Korfhange, 1997; Meadow, Boyce, Kraft,

& Barry, 2007; Wolfram, 2003).

“According to Doyle (1975, p.267), Maron and Kuhns (1960) were the first to describe in the open literature the use of association (statistical co-occurrence) of index terms as a means of enlarging and sharpening the search” (van Rijsbergen, 1979, p. 108).

Maron and Kuhns (1960) reported a “Probabilistic Indexing” technique “for literature indexing and searching in a mechanized library system” (p. 216). This Probabilistic

Indexing “allows a computing machine, given a request for information, to make a statistical inference and derive a number (called the ‘relevance number’) for each

16 document, which is a measure of the probability that the document will satisfy the given request” (p. 216). The “result of a search is an ordered list of those documents which satisfy the request ranked according to their probable relevance” (p. 216).

Robertson and Sparck Jones (1976) developed a theoretical framework (i.e., a probabilistic theory of relevance weighting) and derived a series of relevance weighting functions based on the assumptions that the terms occur independently (i.e., no term co- occurrence information) and probability ordering principles, given binary document descriptions (i.e., relevant or non-relevant). The model “later became known as the binary independence retrieval (BIR) model” (Baeza-Yates & Riberio-Neto, 1999, p. 30).

Salton and McGill’s approach for a probabilistic model is that “if estimates for the probability of occurrence of various terms in relevant documents can be calculated, then the probabilities that a document will be retrieved, given that it is relevant, or that it is not, can be estimated” (Chowdhury, 2004, p. 175). Their approach (adopted from Chowdhury,

2004, pp. 174-176; Salton & McGill, 1983, pp. 94-99) was based on the following parameters:

1. P(Rel) : the probability of relevance of a record, i.e., document

2. P(NotRel) : the probability of non-relevance of a document, where P(NotRel) = 1 –

P(Rel)

3. LOSS 1: a loss parameter associated with the retrieval of a non-relevant document

4. LOSS 2: a loss parameter associated with non-retrieval of a relevant document

The loss of retrieving a non-relevant document is LOSS 1*(1 – P(Rel)) and the loss of rejecting a relevant document is LOSS 2*P(Rel) . Therefore, the total loss can be minimized if a document retrieved whenever LOSS 2*P(Rel) ≥ LOSS 1*(1 – P(Rel)).

17 Equivalently, a discriminant function DISC may be defined as

P(Re l) LOSS DISC = − 1 1− P(Re l) LOSS 2 and a document may be retrieved whenever DISC is greater than or equals zero.

P(B | A *) P(A) Based on the Bayes’ theorem, i.e., P(A | B) = , given a document Doc, P(B) the probabilities of relevance and non-relevance of a document, P(Rel|Doc) and

P(NotRel|Doc) , respectively, are:

P(Doc | Re l *) P(Re l) P(Re l | Doc ) = P(Doc ) and

P(Doc | Not Re l *) P(Not Re l) P(Not Re l | Doc ) = P(Doc )

Assuming the two loss parameters are equal to 1 (i.e., LOSS 1 = LOSS 2 = 1), a retrieval rule calls for retrieval whenever P(Rel|Doc) ≥ P(NotRel|Doc) or whenever the discriminant function DISC ≥ 1 where

P(Re l | Doc ) P(Doc | Re l *) P(Re l) DISC = = P(Not Re l | Doc ) P(Doc | Not Re l *) P(Not Re l)

Other probabilistic models including the Darmstadt indexing approach (Fuhr &

Buckley, 1991), the staged logistic regression model (Cooper, Gey, & Dabney, 1992), the probabilistic inference model (Wong & Yao, 1995), the n-Poisson indexing models

(Margulis, 1992), and (Bayesian) inference network models (Turtle & Croft, 1991), were identified in the survey on probabilistic IR models by Crestani, Lalmas, van Rijsbergen, and Campbell (1998).

18 2.3.4 Models Based on Browsing

McDonald and Chen (2006) stated that “page-level context information may be helpful to the user when making relevance decisions” (p. 112). They argued that query- based searches retrieved documents matching the query but neglect the greater context of the documents. Compared with searching, a different strategy, namely browsing, is

“particularly effective for information problems that are ill-defined or interdisciplinary”

(Marchionini, 1995, p. 100). In a more informal and opportunistic means for information seeking, browsing models allows information seekers to navigate, scan, scroll, etc. to explore documents (i.e., nodes) that may help information seekers to fill their information needs, or “to better define their information needs or to modify their needs in a direction they may not have considered” (Wolfram, 2003, p. 20). In addition, in a hypertext system, browsing allows nonlinear access to documents, provides more interactions between the information seekers and the system, and leads to “a wider variety of cycles within the overall process” during browsing ( Marchionini, 1995, p. 101). However, in a huge system such as the Web, browsing a large number of potentially relevant documents may become tiresome and information seekers may become disoriented due to browsing from one document to another (Wolfram, 2003). Thus, there is still a need to “better understand browsing as an information-seeking strategy” and to design an interface to “support seamless browsing across and within documents” (Marchionini, 1995, pp. 100-101).

Early research conducted by Geller and Lesk (1983) studied whether “users prefer selection from a menu or specification of keywords to retrieve documents” (p. 130).

They tried two experiments. One used an on-line library catalog system and the other used an on-line news wire. Both systems allowed users to access information by keyword

19 searches or by menu selections. Their results showed that keyword searches (79% of the total searches) were preferred over the menu selections (21% of the total searches) in the library experiment and menu browsing of stories was 50% more common than keyword searches in the news wire experiment. They suggested that “the difference is based on the degree of user foreknowledge of the data base and its organization” (p. 130). For library users who might have a particular book in mind, searching with keywords would be straightforward while browsing menus of a complex library database would be time- consuming. For news wire users who most likely did not know what was in the news,

“menu-type interfaces tell the user what is available” and browsing menus in a relatively small daily news database would be manageable (p. 130).

While accessing large online data collections such as the World Health

Organization’s Statistical Information System (WHOSIS, http://www.who.int/whosis/),

Tanin, Shneiderman, and Xie (2006) argued that the typical menus or form fill-in interfaces that rarely gave users information about the contents and distributions of data often led users to waste time and network resources with zero- or mega-hit results. They suggested that “the designers of such interfaces generally assume that users are informed about the data and that they want to submit known-item queries rather than probing the data” (p. 403). Therefore, “a more effective, simple, and easy to learn approach for defining queries is needed for public online databases” (p. 403). They developed a prototype system, ExpO, that applied a two-phase querying strategy (i.e., a previews phase and a refinement phase) prior to submittal of the final query to reduce the resource load for the final query. In the first phase, the generalized query preview displayed the number of potential results for each keyword submitted by the user along with the

20 associated schema and attributes used in the database. In the second phase, users could refine the query based on this information to avoid zero- or mega-hit results. Large data collections including the census data hosted by United States Census Bureau

(http://www.census.gov/), the National Aeronautics and Space Administration’s Global

Change Master Directory (http://gcmd.gsfc.nasa.gov/), and the Internet Movie Database

(http://www.imdb.com), were used for their experiments. Their results showed that the

ExpO system was preferred and performed faster than form fill-in interfaces when tasks with vague query definitions were used during their experiments. They concluded that

“for exploratory querying tasks, generalized query previews can speed user performance for certain user domains and can reduce network/server load” (p. 402).

McDonald and Chen (2006) stated that the Internet lacks generic summaries and suggested that Web search engines “could provide various types of summaries based on the user’s need given the information-seeking task being performed” (p. 112). They developed a generic summarizer, the Arizona (AZ) Summarizer, and two extensions: AZ full-sentence, hybrid summarizer and AZ snippet, query-based summarizer. “The generic summarizer used a blend of discourse information and information obtained through traditional surface-level analysis. The query-based summarizer used only query-term information, and the hybrid summarizer used some discourse information along with query-term information.” (p. 111). In other words, the generic summarizer analyzed the context of a document and ranked full sentences to generate summaries. The snippet summarizer created a summary using the query terms in the document and the adjacent characters of text on each side of the query term. They evaluated the summaries using two types of information seeking tasks: browsing and searching. The results from 297

21 subjects showed that, given the browsing tasks, “the generic summary guided users to relevant documents 51% of the time” compared to 33% for the hybrid summary and 29% for the snippet summary (p. 131). Given the search tasks, the query-based summary guided users to relevant documents 36% of the time compared to 32% and 27% for the hybrid and generic summaries, respectively. “Overall, users were slightly more successful using the summaries for browsing tasks (37% accuracy) than in search tasks

(34% accuracy)” (p. 131). They concluded that generic summaries helped users to better select relevant documents in browsing.

All these three studies discussed strategies that can be used in browsing models to enhance IR. In summary, the way users decided to retrieve information by browsing a menu or by specifying keyword searches may dependent on the degree of users’ foreknowledge of the database and its organization (Geller & Lesk, 1983). A preview menu with which the user could browse information regarding the contents and distributions of potential results allowed users to refine the query prior to submitting the final query to avoid zero- or mega-hit results and also reduced wasted time and use of network resources. This strategy was special useful while accessing large online data collections (Tanin, Shneiderman, & Xie, 2006). McDonal and Chen (2006) suggested that a generic summary helped users to better select relevant document for browsing tasks, while query-based summary helped users to better select relevant documents for search tasks.

22 2.2.5 Semantic Representations

Above, we have briefly reviewed a number of IR techniques that can be found in current operational systems. None of these techniques makes use of an explicit semantic representation. However, the use of the VSMs, LSI and probabilistic methods all attempt to provide surrogates for searches based on the true semantic relationships among concepts, and based on semantic representations of document contents. All of these methods are potentially complementary to a system that makes use of semantically-based search techniques, in the sense that hybrid IR techniques can be developed that make use of several methods in order to overcome the limitations of any single method, whether semantically-based or not.

Below, we discuss semantically-based search techniques that have been developed. This includes a discussion of the use of thesauri, as well as concepts that have been developed under the label of the Semantic Web.

2.2.5.1 Thesauri

A thesaurus represents a knowledge base that contains a list of terms, each consisting of either a single word or a phrase, along with relationships among them such as equivalence, horizontal, hierarchical, non-hierarchical, and other relationships based on the different types of classifications.

WordNet (http://wordnet.princeton.edu/) is a very large lexical thesaurus for the

English language. It provides information about relationships among words. Examples of some semantic relationships constructed in the WordNet include antonymy , entailment , holonymy , hypernymy , hyponymy , meronymy , and troponymy (“Glossary of,” 2006). A

23 concise definition and examples of each of these terms are composed from WordNet, and

Answers.com (http://www.answers.com) as follows:

antonymy – Opposite; “The semantic relation that holds between two words that can (in a given context) express opposite meanings” (“antonymy, WordNet,” 2008). For example, “the word ‘wet’ is an antonym of the word ‘dry ’” (“Results for antonym,” 2008). entailment – “Something that is inferred (deduced or entailed or implied)”. For example, “His resignation had political implications” (“entailment, WordNet,” 2008). holonymy –“ Whole to part relation;” “The semantic relation that holds between a whole and its parts” (“holonymy, WordNet,” 2008). For example, ‘hand ’ is a holonym of ‘ ’. “Holonymy is the opposite of meronymy ” (“Results for holonymy,” 2008). hypernymy – “Superordination;” “The semantic relation of being superordinate or belonging to a higher rank or class” (“hypernymy, WordNet,” 2008). For example: ‘vehicle ’ is a hypernym of the words, ‘airplane ’, ‘automobile ’, ‘chariot ’, and ‘train ’ (“Results for hypernym,” 2008). Hypernymy is the opposite of hyponymy . hyponymy – “Subordination;” “The semantic relation of being subordinate or belonging to a lower rank or class” (“hyponymy, WordNet,” 2008). For example, ‘red ’, ‘white ’, ‘blue ’, etc. are hyponyms of ‘colour ’” (“Results for hyponym,” 2008). Hyponymy is the opposite of hypernymy . meronymy – “Part to whole relation;” “The semantic relation that holds between a part and the whole” (“meronymy, WordNet,” 2008). For example, “’finger ’ is a meronym of ‘ hand ’” (“Results for meronymy,” 2008). Meronymy is the opposite of holonymy . troponymy – “The semantic relation of being a manner of doing something.” For example, “’march ’ is a troponym of ‘walk ’” (“troponymy, WordNet,” 2008).

Other well-known thesauri include Education Resources Information Center

(ERIC) Thesaurus

(http://www.eric.ed.gov/ERICWebPortal/Home.portal?_nfpb=true&_pageLabel=Thesaur us), an online digital library of education research and information sponsored by the

Institute of Education Sciences of the U.S. Department of Education (“About ERIC,”

24 2008), and the Dublin Core metadata element set provided by the Dublin Core Metadata

Initiative (DCMI, http://dublincore.org/).

A thesaurus can be used as a “classification structure” or a “controlled vocabulary” that helps in coordinating indexing and retrieval (Rada & Martin, 1987). It can also be used to broaden query terms, thus increasing recall (Salton & McGill, 1983).

Finally, a thesaurus can be used to support browsing, assisting the information seeker in broadening or narrowing his or her topic or query, or in identifying some related topic of interest (Chowdhury, 2004; Rada & Martin, 1987; Salton & McGill, 1983; Srinivasan,

1992).

Thesauri can be constructed manually or automatically. Manually constructed thesauri include general purpose and word-based thesauri such as Roget’s thesaurus

(http://thesaurus.reference.com/), WordNet, etc. and IR-oriented and phrase-based thesauri such as Library of Congress Subject Headings (LCSH, http://www.loc.gov/library/libarch-thesauri.html), Medical Subject Headings (MeSH, http://www.nlm.nih.gov/mesh/), etc. (Jing & Croft, 1994). Manual thesauri must be updated periodically to include new terminologies and are very expensive to build, as they require careful construction by human experts (Araujo & Perez-Aguera, 2006; Jing

& Croft, 1994).

Automatically generated thesauri are usually collection-dependent (i.e.,

“dependent on the text database which is used”) (Jing & Croft, 1994, p. 147). According to Jing and Croft (1994), “research related to automatic thesauri dates back to

Spark-Jones's work on automatic term classification, Salton's work on automatic thesaurus construction and query expansion, and van Rijsbergen's work on term

25 co-occurrence” (p. 147). Jing and Croft (1994) developed a program to automatically construct collection-dependent association thesauri using large full-text document collections. Such automatically generated thesauri rely on the detection of different patterns of statistical co-occurrence for character strings (words) in a document collection, however, and are thus surrogates for thesauri based on true semantic relationships.

Below, we review a number of studies that described and evaluated different approaches for developing and using thesauri to enhance IR. These studies illustrate a range of ad hoc techniques to construct or extend thesauri, and a variety of techniques for using these thesauri to improve IR.

Study 1. Foo, Hui, Lim, and Hui (2000) proposed a process to automatically generate a Chinese thesaurus that was capable of providing related terms to user’s queries.

The thesaurus was generated based on the co-occurrence of domain-specific terms found in a document collection. The MG (Managing Gigabytes) system, an existing English language based IR system developed by Witten et al. (1994 as cited in Foo, Hui, Lim, &

Hui, 2000, p. 235) was adapted and modified to generate the Chinese thesaurus and modified to support IR with the generated thesaurus. Articles in a Chinese newspaper were used as the corpus to evaluate the system. Users were able to select additional terms suggested by the thesaurus and re-run the query to generate a new set of results. They confirmed that such an automatically generated thesaurus was applicable for Chinese IR and that the use of the thesaurus significantly improved recall on average for 17.2% and

19.5% for their two approaches.

Study 2. Mandala, Tokunaga, and Tanaka (2000) suggested that combining a set of thesauri would provide a valuable resource for query expansion since each thesaurus

26 often has different characteristics. They proposed a method to improve IR by using heterogeneous thesauri for query expansion. The expansion terms were taken from both a manually constructed thesaurus (WordNet) and automatically generated thesauri (a co- occurrence based thesaurus and a predicate-argument-based thesaurus (i.e., words with grammatical relations such as subject-verb, verb-object, and adjective-noun relations)) based on the average of the similarities between terms from the three types of thesauri.

The Text REtrieval Conference 7 (TREC-7) text collection was used for the experiments.

The results supported their hypothesis that “the use of the combined set of thesauri gives better retrieval results than just using one type of thesaurus” (p. 377). Testing on the documents (including the Title, Description, and Narrative) provided by the TREC-7 collection, the average precision was improved by 30.2% when the combined set of thesauri (WordNet, predicate-argument-based, and co-occurrence based thesauri) was used. When only one thesaurus was used, precision was improved by 1.7%, 11.5%, and

18.1% for WordNet, the predicate-argument-based, and the co-occurrence based thesaurus, respectively. When pair-wise thesauri were used, precision was improved by

15.2%, 22.6%, and 29.8% for the pairs WordNet and the predicate-argument-based thesaurus, WordNet and the co-occurrence based thesaurus, and the predicate-argument- based and the co-occurrence based thesauri, respectively.

The Text REtrieval Conference (TREC, http://trec.nist.gov/) series, co-sponsored by the National Institute of Standards and Technology (NIST) and the U.S. Department of Defense (DOD), was started in 1992 as part of the first phrase of the TIPSTER Text

Program which was led by Defense Advanced Research Projects Agency (DARPA) to advance the text processing technologies through the cooperation among government,

27 industry, and academic. The purpose of TREC was “to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies” (“Overview,” 2007). TREC provides large test collections in a laboratory environment to support large-scale evaluation of IR using conventional measures based on recall and precision (“Overview,” 2007; Sparck Jones,

2000; “TIPSTER Text,” 2001). TREC “has succeeded in standardizing and validating the use of test collections as a research tool for ad hoc retrieval, and has extended the use of test collections to other tasks” (Buckley & Voorhees, 2005, p. 53).

Study 3. Varelas, Voutsakis, Raftopoulou, Petrakis, and Milios (2005) argued that classical IR models such as vector space models may fail to retrieve documents with semantically similar terms because these models were based on lexicographic term matching but semantically similar terms may be lexicographically different. They proposed an IR model, Semantic Similarity Retrieval Model (SSRM) that was able to

“detect similarities between documents containing a semantically similar but not necessarily lexicographically similar term” (p. 10). Firstly, the model computed weights of the original query terms based on the frequency of occurrence in the document collection. Secondly, the query was augmented by the semantically similar terms (i.e., synonyms, and then hyponyms and hypernyms) with similarity greater than a threshold value 0.9 from WordNet. These semantically similar terms along with the original query terms were then re-weighted and the terms with higher similarities (i.e., greater than a threshold value 0.9) were used for expanding the original query. They reported that

“SSRM is far more effective than VSM, achieving up to 30% better precision and up to

20% better recall” (p. 15). They concluded that “the efficiency of SSRM is mostly due to

28 the contribution of non-identical but semantically similar terms” (p. 15) which was ignored by most classical retrieval models that relied on lexicographic term matching such as VSM.

Study 4. Araujo and Perez-Aguera (2006) generated an initial thesaurus for a particular domain of knowledge and then applied semantically related terms extracted from an online dictionary to enrich this initially generated thesaurus for IR. The core set of the initial thesaurus consisted of the nouns, which characterized the intended domain, from the index terms of a text collection. The initial thesaurus was then constructed from three source thesauri including EUROVOC, SPINES, and ISOC, where Eurovoc

(http://europa.eu/eurovoc/) is a multilingual thesaurus covering control vocabularies in the documentation systems of the European Union (“Eurovoc Thesaurus,” 2007),

SPINES is a controlled and structured vocabulary for policy-making, management and development in the field of science and technology, and ISOC is a thesaurus “aimed at the treatment of information on economy” (p. 271). They analyzed the outcome (i.e., definitions in the content) of a Spanish online dictionary RAE (Real Academia Espanola) and observed that semantic relationships between terms were often constructed based on the hierarchy patterns such as “ Of or related to NOUN and Of or relating to ARTICLE

NOUN ” (p. 269). For example, if the entry was chemical , the definition contains “Of or relating to chemistry”. If the entry was numerical , the definition contains “Of or relating to a number”. They used this pattern matching technique to extract semantically related terms and associated hierarchical relationships from RAE to enrich their initial thesaurus.

One of the document collections provided by the Cross-Language Evaluation Forum

(CLEF, http://www.clef-campaign.org/) for Spanish was used to evaluate IR. They

29 reported that the precision and recall were improved by 12.9% and 16.8%, respectively, with the enriched thesaurus.

Study 5. Imai, Collier, and Tsujii (1999) argued that even though there existed several query expansion methods, “few studies have explored the combination of these approaches which use different knowledge sources” (p. 292). They experimented with two traditional query expansion methods; a similarity thesaurus based method and a local feedback method, in sequence, to reformulate queries. The similarity was defined as the sum of the weighted relevance of a thesaurus term to each term in the query. The query was first expanded by adding the top n conceptually similar terms to the query from the similarity thesaurus based method. The query was then expanded by “adding weight of terms in the top ranked n relevant documents and reducing the weight of terms in last m documents of the initial retrieval” (p. 292) from the local feedback method. The new query thus included the original query terms, the relevant terms from the thesaurus, and the terms from top n documents, and excluded the terms from the last ranked m documents. Each set of terms was given a preset weight. The MED test collection, containing 1033 MEDLINE abstracts and 30 queries, was used for evaluation. MEDLINE

(Medical Literature Analysis and Retrieval System Online) is the National Library of

Medicine’s (NLM, http://www.nlm.nih.gov/) premier online medical bibliographic database that contains millions of references to journal articles in life sciences with a concentration on biomedicine (“MEDLINE,” 2008). The performance of the retrieval was measured by average precision. In their experiments, the last 40 documents were used as the non-relevant documents. The results showed that the average precision increased

27.2% when 60 terms were added by the thesaurus method and the 15 top ranked

30 documents were used by the local feedback procedure. They suggested some indication that “the performance of the combined expansion methods is better than that of applying each method by itself” (p. 292).

Study 6. Tudhope, Binding, Blocks, and Cunliffe (2006) developed a semantic expansion engine including a semantic closeness algorithm and a matching function to assist query expansion in indexed collections. The thesaurus was the Art and Architecture

Thesaurus (AAT, http://www.getty.edu/research/conducting_research/vocabularies/aat/) and the collection was the UK National Museum of Science and Industry

(http://www.nmsi.ac.uk/). The closeness (or distance) of two concepts was measured based on the minimum number of semantic relationships traversed to connect the two concepts and boundary of the expansion was set by a configurable cut-off threshold. The matching function assigned matches between a query term and a semantically close index term based on the associated cost factor of each traversal. The system was presented but no empirical results were provided regarding improvement of IR.

Study 7. Riekert (2002) pointed out that full-text retrieval, short of semantic interpretation of search terms, was not sufficient for all application areas and many users might prefer to browse rather than search in meta-information systems such as the

German Environmental Information Network of the German Federal Environment

Ministry. He developed a prototype of a thesaurus-based, catalog-navigating IR system,

Thesaurus Navigator, and presented examples of the effective usage of the knowledge stored in thesauri for automatic search for information on the Internet. The

Environmental Thesaurus of the German Federal Environmental Agency (GFEA) was used as an information source to generate a navigable catalog. The catalog was generated

31 based on three semantic relationships: a) the “used-for” relationship among synonyms, b) the hierarchical relationship between broader and narrower terms, and c) the linkage between related terms within the controlled vocabularies. Users were able to navigate the hierarchical thesaurus catalog generated from the relevant terms in the Environmental

Thesaurus. The search query was then reformulated prior to being transmitted to the default search engine, Altavista (http://www.altavista.com/), through the HyperText

Transfer Protocol (http) by the Java-based Thesaurus Navigator. He concluded that, based on the reactions of the test users, a large number of users would appreciate an interaction mode of navigation on a catalog system as an additional option. However, there was no formal evaluation of the IR system.

Study 8. Greenberg (2001) pointed out that “while researchers have explored the value of structured thesauri as controlled vocabularies for general information retrieval

(IR) activities, they have not identified the optimal query expansion (QE) processing methods for taking advantage of the semantic encoding underlying the terminology in these tools” (p. 487). She examined “whether QE via semantically encoded thesauri terminology is more effective in the automatic or interactive processing environment” (p.

487). The results showed that for narrower terms, the mean incremental relative recall for automatic (system controlled) QE was 5.4% greater than that for interactive (end-user controlled) QE and was found statistically significant ( p < 0.05 ). For related terms, the mean precision for interactive QE was 21.4% greater than that for automatic QE and was found statistically significant ( p < 0.05 ). The major conclusions were that “narrower terms are generally good candidates for automatic QE because they increased relative recall with a loss in precision that was not significant”, and that “related terms are better

32 candidates for interactive QE because they increased relative recall via both processes, but the decrease in precision was less that with automatic QE” (p. 496).

Study 9. Chen, Ho, and Yang (2006) sought to improve the relationships between the categories and subcategories on Web catalogs by applying hierarchical structures for catalog integration. They proposed “an enhanced hierarchical catalog integration (EHCI) approach with conceptual relationships extracted from the source catalog thesaurus” to improve the accuracy of Web catalog integration (p. 635). The semantic concepts in the source catalog were extracted and the conceptual relationships between layered categories were assigned based on a thesaurus weighting scheme. Documents under five different categories were integrated from Yahoo into Google and from Google into

Yahoo to evaluate the integration accuracy. Their figure showed that the range of the integration accuracy was between 60% and 85% with this semantic approach (i.e., applying the conceptual relationships in thesaurus), compared to the range between 50% and 80% of the baseline cases.

Thesauri Summary

These papers serve to highlight several key concepts regarding the use of semantic representations (specifically thesauri) to support IR. These concepts include questions about:

1. Whether semantic representations can be used to improve IR;

2. When and how such representations should be used to automatically expand

queries or to support browsing so that the user can control query refinement.

33 Generally speaking, these previous findings indicate that semantic representations, with various techniques for constructing and using thesauri, can be used to improve IR.

The reported range of improved recall was from 16.8% to 20% (Studies 1, 3, and 4) and the reported range of improved precision was from 12.9% to 30.2% (Studies 2, 3, 4, and

5). The test collections included newspapers (Study 1), Web pages (Study 3), and other collections such as CLEF (Study 4), MEDLINE abstracts (Study 5), and TREC (Study 2).

Note that the goal of presenting these ranges is to give a sense of the levels improvement of IR and is not intended to compare the results among these studies, since the designs of experiments varied. Four other studies (Studies 6, 7, 8, and 9), however, did not report any empirical data on the improvement of recall nor precision.

These studies have demonstrated various techniques for constructing and using thesauri to enhance IR. The methods to select thesaurus terms included query expansion based on co-occurrence of terms (Studies 1 and 2), closeness of terms (Study 6), pattern matching (Study 4), similarity of terms (Studies 3, 5, and 8), thesaurus weighting scheme

(Study 9), semantic relationships of terms (Study 7), and by user interactions (Studies 7 and 8). The thesauri were constructed from difference sources such as the document space (Studies 1, 5, and 9), an online dictionary (Study 4), and existing thesauri such as

AAT (Study 6), the Environmental Thesaurus of GFEA (Study 7), EUROVOC, SPINES, and ISOC (Study 4), ProQuest Thesarurus (Study 8), and WordNet (Studies 2 and 3).

2.2.5.2 Semantic Web

The Semantic Web aims to “make use of information embedded in the documents primarily for the purpose of helping the computer understand the content so that it can do

34 a better job of finding just the information needed by a user” (Meadow, Boyce, Kraft, &

Barry, 2007, p. 206). Where HyperText Markup Language (HTML) assigns tags to control the display of information, eXtensible Markup Language (XML), the tool of

Semantic Web, assigns tags to identify the content of information, In additional to XML, the Resource Description Framework (RDF), an application of XML, is also a key technology in developing the Semantic Web to exchange machine-understandable information on the Web. “In RDF, a document makes assertions that particular things

(People, Web pages or whatever), have properties (such as ‘is a sister of,’ ‘is the author of’) with certain values (another person, another Web page)” (Berners-Lee, Hendler, &

Lassila, 2001). The third basic component of the Semantic Web is ontologies. An ontology, as used in the field of artificial-intelligence and Web searches, is “a document or file that formally defines the relations among terms” (Berners-Lee, Hendler, & Lassila,

2001). The Web Ontology Language (OWL) is designed for processing the content of information and facilitates greater machine interpretability of Web content than that supported by XML and RDF by providing additional vocabulary along with a formal semantics (Berners-Lee, Hendler, & Lassila, 2001; Meadow, Boyce, Kraft, & Barry,

2007; Rubin, 2004; “OWL Web,” 2004).

Researchers have been developing IR systems for Semantic Web and applications in different domains to utilize the capability of the Semantic Web.

Shah, Finin, Joshi, Cost, and Mayfield (2002) argued that “current information retrieval techniques are unable to exploit the semantic knowledge within documents and hence cannot give precise answers to precise questions” and claimed that “indexing text and semantic markup together will significantly improve retrieval performance” (p. 461).

35 By including semantic markup as indexing terms, they developed an integrated system

OWLIR (Ontology Web Language and Information Retrieval) to extract (i.e., IR) and reason (i.e., simple and complex question answering) about contents from text documents and semantically marked up Web pages for better precision. The TREC evaluation package was used to evaluate the system. The average precision was 25.9%, 55.2%, and

85.5% for three different types of documents: text only (i.e., unstructured data), text with semantic markup (i.e., unstructured and structured data), and text with semantic markup that has been augmented by inference (i.e., unstructured and structured data, and inferred data). (The inference was done by a) “reasoning over ontology instances (e.g., deriving the Date and Location of a Basketball Game); and b) reasoning over ontology hierarchy

(e.g., a Basketball Game Event is a subclass of Sport Event)” (p. 466).) They concluded that “semantic markup contained in documents and additional information from external sources, begets higher precision” (p. 467).

Ding, Finin, Joshi, Pan, Cost, Peng, et. al. (2004) stated that the current Web search engines like Google and AlltheWeb (http://www.alltheweb.com/) are “designed to work with natural languages and expect documents to contain unstructured text composed of words” (p. 657). Therefore, these search engines do not work well with

Semantic Web Documents (SWD) which are “characterized by semantic annotation and meaningful references to other SWDs” (p. 652). They developed Swoogle, “a crawler- based indexing and retrieval system for the Semantic Web” (p. 652). The system was designed to “automatically discover SWDs, indexes their metadata and answers queries about it” (p. 652). “It runs multiple crawlers to discover SWDs through meta-search and link-following; analyzes SWDs and produces metadata about SWDs and the relations

36 among SWDs; computes ranks of SWDs using a ‘rational random surfing model’; and indexes SWDs using an information retrieval system by treating URIrefs as terms” (p.

658). The system has “analyzed over 137,000 semantic web documents with 3,900,000 triples” and expected to discover more SWDs to its database (p. 658).

Yu, Mine, and Amamiya (2005) suggested that “with semantic Web technology,

Web information can be given well-defined meaning which can be understood and processed by machines” (p. 974). They proposed “a conceptual architecture for a personal semantic Web information retrieval system” (p. 974). The system “incorporates semantic

Web, Web services and multi-agent technologies to enable not only precise location of

Web resources but also the automatic or semi-automatic integration of hybrid Web contents and Web services” (p. 974).

Cheung, Yip, Smith, deKnikker, Masiar, and Gerstein, (2005) applied Semantic

Web techniques for integrating data in the life sciences domain. They developed a prototype Web-based application, YeastHub, to demonstrate that a data warehouse, in the life science domain, could be built by using “a native RDF database system (Sesame) to store and query diverse types of yeast genome data across multiple sources” (p. i95).

“Once the data are loaded into the data warehouse, RDF-based queries can be formulated to retrieve and query the data in an integrated fashion” (p. i85).

Wu, Yu, and Chen (2008) adopted Semantic Web as a solution for Traditional

Chinese Medicine (TCM) integration, management, and utilization. A set of tools and systems including semantic graph mining was developed for collective intelligence. They presented a Semantic Web platform that “demonstrates the Semantic Web's ability to

37 connect data from interrelated domains for interdisciplinary research, and contributes to the preservation and modernization of TCM as intangible cultural heritage” (p. 1086).

2.3 Post-Retrieval Processing

Ranking and refinement are two common tasks after relevant documents have been retrieved to further match the information seeker’s needs.

2.3.1 Ranking

Ranking algorithms are used to determine the potential relevance of documents to the submitted queries. Boolean systems provide powerful search capabilities for trained intermediaries such as librarians but not necessarily typical end-users, especially those who are not familiar with Boolean systems (Harman, 1992a). Harman (1992a) states that ranking approaches allow the user “to input a simple query such as a sentence or a phrase

(no Boolean connectors) and retrieve a list of documents ranked in order of likely relevance” (p. 363).

Harman (1992a) presented a survey of ranking algorithms, their experiments, and implementations. Ranking methods have been based on term-weighting methods such as how often the terms occur in a document, i.e., twenty times vs. two times, etc., or based on document structures, such as where the terms occur, i.e., in the title, summary paragraphs, or text body, etc., or the combinations (Harman, 1992a). Ranking methods used by some search engines are based on hyperlink analysis such as the PageRank, i.e., the number of links linked to a document, to rank documents, which is used by Google.

38 Google is currently the most popular search engine on the Web (“Nielson online,”

2008). The center of the Google search is based on the PageRank algorithm proposed by

Brin and Page (Brin & Page, 1998). Brin and Page borrowed the concept of academic bibliographic citations and viewed hyperlinks as citations or votes. They assumed that there is a “random surfer” who starts at one Web page and then rides randomly to another page. PageRank is the probability that the random surfer visit a page. The PageRank of a page is a measure of link popularity as a function of the page’s links, i.e., inlinks and outlinks. A link from page A to page B is interpreted as casting a vote, by page A, for page B. The more the votes are, the more contribution to the PageRank of the pages. In other words, PageRank is based on the number of pages that link to a given page or the number of pages linked to each found page (Brin & Page, 1998; “Google searches,”

2008).

2.3.2 Refinements

The initially retrieved results may need to be refined to meet the information seeker’s needs. This can be done by searching for related terms in the initially retrieved sets to further match the query. The process may go through iterations based on the queries and the algorithms used. Another approach is by relevance feedback (e.g., a simple relevance feedback such as “more like this”) or query modification (or reformulation) by adding broader or narrower terms, certain related terms, additional or alternative query formulation possibilities (Harman, 1992b; Hearst, Elliott, English,

Sinha, Swearingen, & Yee, 2002; Salton & McGill, 1983). The feedback mechanisms can be manual or automatic. In manual relevance feedback, “relevant documents are

39 identified and new terms are selected either manually or automatically” (Grossman, 2001, p. 5). In automatic relevance feedback, “relevant documents are identified automatically by assuming the top-ranked documents are relevant and new terms are selected automatically” (Grossman, 2001, p. 5). Experimental work has shown that relevance feedback improves precision in retrieval. The information seeker is able to view the intermediate results such as the titles, brief descriptions, etc. and then to modify original query to resubmit a more on-target query based on the intermediate results that appear to be relevant and/or irrelevant to the information seeker (Berry & Murray, 2005; Salton &

McGill, 1983).

In addition, the queries used by information seekers to seek information may often be different from the terms used to index the documents. Furnas, Landauer, Gomez, and Dumais (1987) find that the probability of two people choose the same word to describe the same concept is less then 0.20. Techniques for automatic query expansion have been used to solve vocabulary problems (Carpineto, Ramano, & Giannini, 2002).

These techniques can be categorized into global techniques and local techniques. Global techniques such as LSI rely on statistics from the whole corpus (collection). Local techniques such as retrieval feedback expand queries based on information from the set of top-ranked documents based on the assumption that the top-ranked documents are relevant. The problem with the global techniques is that the global techniques are not as efficient and effective as some local techniques. The problem with the local techniques is that the top-ranked documents used for local techniques may contain many irrelevant terms or may even be irrelevant (Xu & Croft, 2000). Also, the top-ranked documents may not contain information broad enough for the information seeker’s information needs.

40 The use of a knowledge base for query modification could help overcome such vocabulary problems.

2.4 Search Assistance Provided by Search Engines

Commercial search engines are the primary tools to retrieve information from the

Internet. Transaction logs provided by search engines have been used to analyze searching behaviors on the Internet (Wolfram, 2008). Jansen and Pooch (2001) find that

Web users, in each search session, submit fewer and shorter queries (two terms per query), view ten or fewer retrieved documents, and rarely use Boolean operators or other advanced features provided by the search engine used to perform the searches. Rieh and

Xie (2006) identify eight search patterns of Web query reformulations and suggest the development of secondary search tools that support more complex query reformulation behaviors during the searches. Many commercial search engines now provide search assistance, i.e., query suggestions, for users to refine or reformulate their search queries.

White and Marchionini (2006) find that “real-time query expansion improved the quality of initial queries for both known-item and exploratory tasks, making it potentially useful during the initiation of a search, when searchers may be in most need of support” (p. 716).

Google, Yahoo, Live Search, and Ask all provide real-time search query suggestions but in different ways. The keywords “hand cut” and “hand” are used below as an example to demonstrate the differences in the real-time search query suggestions among these four well-known search engines.

Google and Live Search provide real-time search query suggestions, if any, when the search results are displayed in the results page. In our example, the search keywords

41 “hand cut” did not trigger any search suggestions. However, if the search keyword

“hand” was used, Google provided two sets of suggestions: “Related searches” displayed at the top of the search results and “Searches related to” displayed at the bottom of the search results. Table 2.1 shows a list of Google’s search suggestions, i.e., the “Related searches” and “Searches related to”, when using the keyword “hand” in Google. Figure

2.1 shows a screen image of the real-time search assistance from Google’s results page.

Search Assist Search Suggestions At Top: Related searches hand pain, handbags, hand measurement, hand tremors

At Bottom: Searches related to: hand pain, handbags, hand measurement, hand tremors, hand hand numbness, broken hand

Table 2.1: Real-time search assistance by Google (Retrieved May 8, 2008)

42

Figure 2.1: A screen image of the real-time search assistance (“Related searches” and “Searches related to”) from Google (Retrieved May 7, 2008) 43 Similar to Google, Live Search provided search query suggestions when using the search keyword “hand” but not “hand cut”. Table 2.2 shows the “Related searches” for the keyword “hand” suggested by Live Search. The suggestions “Related searches” were located at the top corner of the right hand side of the results page (Figure 2.2).

Search Assist Search Suggestions At Right: Related searches Pictures Of , Pictures Human Hands, Helping Hands, Hand Foot And Mouth Disease, Human Hand

Table 2.2: Real-time search assistance by Live Search (Retrieved May 8, 2008)

Unlike Google or Live Search, both Yahoo and Ask automatically suggest search queries and topics while the user is typing search keywords in the search box. In Yahoo, while typing the search keywords “hand cut”, when “han” was keyed-in in the search box, five search query suggestions were automatically displayed right below the search box.

The suggested search queries changed when additional characters (letters, numbers, or other characters) were keyed-in. When the search keyword “hand” was keyed-in, another five search query suggestions were displayed. When the intended search keywords “hand cut” was completely keyed-in, Yahoo generated another five search query suggestions.

(In order to compare the suggested queries with Google and Live Search, the search keyword “hand,” instead of “hand cut,” is used for illustration purpose.) Figure 2.3 shows a screenshot of Yahoo’s search homepage with the search keyword “hand” keyed-in in the search box. Yahoo’s Search Assist is located right below the search box with the matched characters, i.e., “han” or “hand,” highlighted. At the results page, Yahoo’s

44

Figure 2.2: A screen image of the real-time search assistance (“Related searches”) from Live Search (Retrieved May 8, 2008)

45

Figure 2.3: A screenshot of the real-time search assistance (“Search Assist”) from Yahoo (Retrieved May 8, 2008)

Search Assist is hidden by default but a list of search query suggestions followed by

“Also try” is provided, located at the top and the bottom of the search results (Figure 2.4).

The Search Assist can be exposed by clicking on the arrow-down tab below the search box, by clicking the More at the end of the “Also try”, or by keying-in characters in the search box. Additional suggested search queries and concepts related to the search keywords are provided when the Search Assist is exposed and, in the meantime, the

“Also try” (at the top) becomes hidden (Figure 2.5). The search queries and topics

(concepts) suggested by Yahoo’s Search Assist are based on “what all users in general tend to find helpful” (“What is,” 2008). Table 2.3 lists the real-time alternative search queries and topics suggested by Yahoo’s Search Assist.

46

Figure 2.4: A screen image of the real-time search assistance (“Also try”) from Yahoo (Retrieved May 8, 2008)

47

Figure 2.5: A screen image of the real-time search assistance (“Search Assist”) from Yahoo (Retrieved May 8, 2008)

48 Search Assist Search Suggestions Below Search Box: Keyed-in Terms: han hannah montana, tom hanks, daryl hannah, hana kimi, miley cyrus hannah montana pictures hand coach handbags, handheld tips2008, chelsea handler, gucci handbags, chanel handbags hand cut 50 cent hand cut off, hand cut dovetails, hand cuts, hand cut crystal, hand cut cz

At Top and Bottom: Also try hand guns, hand guns for sale, coach hand bags More hand tools, free palm hand reading, hand trucks, poker hand rankings, gang hand signs, hand foot and mouth, ranch hand bumpers

Above Search Results: Search terms handbags, handheld tips2008, hand guns, coach handbags, chelsea handler, hands, gucci handbags, chanel handbags, occupational outlook handbook, handling scrollevent through Explore concepts: American Society, Hand , the human hand, , hand + cards, wrist, anatomy, IMDB, Hand Injuries, human thumb, flexors and extensors, hand Corporation, compartment, flexor, metacarpal bones, Articulations

Table 2.3: Real-time search assistance by Yahoo (Retrieved May 8, 2008)

In Ask, while typing the search keywords “hand cut”, when “ha” was keyed-in in the search box, the Search Suggestions automatically displayed a list of suggestions right below the search box. Lists of Search Suggestions were also displayed, if any, when additional characters were keyed-in. Search terms “ha”, “han”, and “hand” each produced a list of ten suggestions while “hand cut” generated only two suggestions. Figure 2.6 shows the screenshot of the Ask’s Search Suggestions generated by the search keyword

“hand” with the keyword “hand” keyed-in in the search box. At the results page (Figure

2.7), Ask has a different layout than Google, Yahoo, or Live Search. The search

49 assistance locates at two different locations. One location was at the top of search results and the other location was at the left hand side column of the results page, both with the matched terms highlighted in bold face. When using the search keyword “hand”, the search assistance at the top location simply displays “Narrow” followed by three search suggestions Human Hand, Anatomy of Hand, and Drawing of a Hand. The search assistance at the left hand side of the results page, however, provides quite a few alternative search terms to narrow or broaden the search, if any. Narrow Your Search provided topics that are specifically related to the search and Expand Your Search provided topics that are conceptually related to the search (“Site Feature,” 2008). These search query suggestions were generated by Ask’s ExpertRank algorithm based on the popularity among pages on the topic (“Ask Search,” 2008). Table 2.4 lists the real-time alternative search queries and topics suggested by Ask’s Search Suggestions.

2.5 Summary

In summary, instead of improving search in relational databases (Liu, Yu, Meng,

& Chowdhury, 2006; Ranganathan & Liu, 2006), we intend to enhance IR on the Internet.

We briefly reviewed the IR techniques, post-retrieval processing, and search assistance provided by search engines as they are used by search engines we seek to enhance.

A new trend, namely, the Semantic Web, was also discussed under the section of semantic representations. Conceptually, SAOA-1 is a semantically based search agent.

However, it was not build based upon the technologies of Semantic Web (studies of

Cheung, Yip, Smith, deKnikker, Masiar, & Gerstein, 2005; Ding, Finin, Joshi, Pan, Cost,

50

Figure 2.6: A screenshot of the real-time search assistance (“Search Suggestions”) from Ask (Retrieved May 8, 2008)

Figure 2.7: A screenshot of the real-time search assistance (“Narrow Your Search” and “Expand Your Search”) from Ask (Retrieved May 8, 2008) 51 Search Assist Search Suggestions Below Search Box: Keyed-in Terms: ha hannah montana, harry potter, halloween, hairstyles, hairstyles pictures, hair styles, hallmark, hawaii, haircuts, hampton inn han hannah montana, hannah montana songs, hannah montana lyrics, hannah montana music, hannah montana tickets, hannah montana cating, hancock fabrics, hanna montana, hannah montana concert tickets, hannah montanna hand handbags, hand foot and mouth disease, hands, handwriting analysis, handyman, handwriting, hand washing, hand painted nail art, handel, hand foot mouth disease hand cut hand cut dovetails, hand cut mat

At Top: Narrow Human Hand, Anatomy of Hand, Drawing of a Hand

At Left: Narrow Your Search Human Hand, Anatomy of Hand, Drawing of a Hand, Cartoon Hand, Holding Hands, Common Gang Hand Signs, Hand Bones, Hand Muscles, Hand Surgery, Open Hand, More Hand Washing, Reaching Hand, Draw a Hand, Helping Hand, Waving Hand, Hand Outline, Pointing Hand, Right Hand, Left Hand, Hand Diagram, Hand Clipart, Black Hand, Skeleton Hand, Hand Measurement, Hand Gestures, Hand Palm

Expand Your Search Eye, Foot, Nose, Finger More Ear, Arm, Mouth, Fist, Leg, Heart, Head, Tongue, Handprint, Lips, Touch, Brain, Hair, Blow, House, Wrist, Sun, Sign Language, Thumbs up, Face, Elbow, Earth, Horse, Fire, Body, Hat, Thumb, Hammer, Nails, Baby

Table 2.4: Real-time search assistance by Ask (Retrieved May 8, 2008)

52 Peng, et. al., 2004; Shah, Finin, Joshi, Cost, & Mayfield, 2002; Yu, Mine, & Amamiya,

2005; and Wu, Yu, & Chen, 2008).

Therefore, major emphases of this literature review were on the use of thesauri including the constructions and applications of thesauri for enhancing searches as discussed and summarized earlier in Section 2.2.5.1.

Currently, commercial search engines are the primary tools to retrieve information from the Internet. Research has been published that recommends enhancements for searching that analyzes transaction logs and user behaviors, that identifies search patterns of Web query reformulations, and that suggests the use of secondary search tool (Jansen and Pooch, 2001; Rieh & Xie, 2006; White & Marchionini,

2006; Wolfram, 2008). Commercial search engines themselves also provide real-time assistance based on popular terms that are related (Google, Live) or popular terms related to other search results (Yahoo, Ask).

This information will be used as a guide for design decisions of the system

SAOA-1.

53

CHAPTER 3

SOFTWARE DESIGN:

CONCEPTUAL DEVELOPMENT AND IMPLEMENTATION

3.1 Introduction

There is no doubt that information seekers would like to effectively retrieve the most relevant information on the topic of interest from the Internet. Commercial search engines like Google, Yahoo, and Live Search all attempt to achieve this goal for given queries based on their own algorithms and rankings. However, in general, these commercial search engines are not yet capable of reasoning about the Web contents or understanding the queries semantically in order to retrieve the most relevant information more effectively (Chiang, Chua, & Storey, 2001).

Consider, for instance, an information seeker who intended to search for information about hand cuts related to occupational accidents. The information seeker might enter “hand” and “cut” as the search keywords.

The first ten search results (as of February 23, 2008) from either Google, Yahoo, or Live Search are indeed related to some sense of “hand” and “cut”, but none of those documents are relevant to occupational injuries, which is what the information seeker intended. Among the first ten results from each search engine, at least half of the

54 documents are related to hand cut dovetails and hand cut crystals. Table 3.1 shows a summary of the content topics of the first ten documents from Google, Yahoo, and Live

Search. Figure 3.1, Figure 3.2, and Figure 3.3 display screen images of the first ten search results for “hand” and “cut” from Google, Yahoo, and Live Search, respectively.

Thus, even though Google, Yahoo, and Live Search are all popular and powerful search tools, in many cases the search results do not match what the information seeker is looking for. This is because these search engines rely heavily on character string matches to retrieve documents. One complementary approach to improve a search in large collections is to take advantage of semantic representations in relational databases that have been developed for purposes other than to support information retrieval. That is the general approach taken in this study.

In a generic sense, we extract the semantics from such a relational database and use these semantic relationships to provide suggestions for refining a keyword (character string) search. Conceptually, the prototype of SAOA-1 is a search agent with embedded domain-specific knowledge that has been automatically extracted from a relational database. SAOA-1 monitors for keyword searches for which its expertise appears relevant and then suggests semantically based refinements of the information seeker’s query.

55 Topic Document Note Google Dovetails Cutting 2, 4, 5 Crystal Gifts 1, 10 Jigsaw Puzzles 6, 7 wooden puzzles Clothing Store 3 Family Medicine 9 discussion forum Siding Products 8 hand-cut stone

Yahoo Crystal Gifts 4, 8, 9, 10 Dovetails Cutting 2, 3 Hand-Cut Coins 5, 7 charms Granite 6 hand-split Wikipedia 1

Live Search Dovetails Cutting 1, 3, 7, 8, 10 Crystal Gifts 4, 9 Siding Products 5 hand-cut stone Magic Tricks 6 handcutcoins.com Wikipedia 2

Table 3.1: A summary of the first ten search results for “hand” and “cut” from Google, Yahoo, and Live Search, respectively (Retrieved on February 23, 2008)

56

Figure 3.1: A screen image of the first ten search results for “hand” and “cut” from Google (Retrieved February 23, 2008)

57

Figure 3.2: A screen image of the first ten search results for “hand” and “cut” from Yahoo (Retrieved February 23, 2008)

58

Figure 3.3: A screen image of the first ten search results for “hand” and “cut” from Live Search (Retrieved February 23, 2008)

59 The rest of this chapter is organized as follows. In the next section, Section 3.2, we revisit the “hand” and “cut” example and present search results with the assistance of

SAOA-1. Section 3.3 discusses how SAOA-1 achieves greater precision of search results by applying several steps, including the use of semantic representation derived from

OSHA’s database. Section 3.4 describes the development of SAOA-1, including the preparation of the knowledge base. Section 3.5 explains the functionalities of SAOA-1 and its variant SAOA-0.

3.2 Test Results with the Assistance of SAOA-1

The homepage of SAOA-1 (Figure 3.4) is a standard Hypertext Markup Language

(HTML) page with a simple design as may be seen in many commercial search engines.

A text box allows users to type in search keywords and a “Submit” button to submit the search query. A brief test was conducted on SAOA-1 using search keywords “hand” and

“cut” as in the example described in the Chapter 3.1 Introduction. Figure 3.5 shows a screen image of the first ten “Original Search” results from SAOA-1. Figure 3.6 displays a partially zoomed in version of Figure 3.5 for readability.

As can be seen under the label Search Query on the left-hand-side of the image,

SAOA-1 generates three sets of information to help information seekers. These three sets are the “Original Search”, the “Alternative Search”, and the “Suggested Refinements for

Alternative Search”.

60

Figure 3.4: A screenshot of the Homepage of SAOA-1

61

Figure 3.5: A screen image of the first ten search results for “hand” and “cut” from the Original Search (Retrieved February 28, 2008)

62

Figure 3.6: A screen image of the partially zoomed in version of Figure 3.5 for readability (Original Search)

63 The Original Search uses the original search keywords “hand” and “cut” to retrieve documents. Note that Google is used by SAOA-1 to retrieve documents.

Therefore, the Original Search is simply a Google search because Original Search retrieves documents using the original search keywords without applying additional knowledge. Thus, the documents retrieved using Original Search should be the same as the documents retrieved using Google if documents were retrieved at the same time. The only difference would be the display of the interface.

A screen image of the first ten search results using the Alternative Search is displayed in Figure 3.7. Figure 3.8 shows a screen image of the partially zoomed in version of Figure 3.7 for readability.

Figure 3.9 shows a screen image of the list of Suggested refinements for

Alternative Search. Figure 3.10 shows the partially zoomed in version of Figure 3.9 for readability. The two checked items, “Hand(s)” under Part of Body and “Cut/Laceration” under Nature of Injury, correspond to the keywords “hand” and “cut” in the original search query. Without any further selection for refinements (using the checkboxes),

Figure 3.11 displays the new query that will be used in the Modified Search in this case.

Note that SAOA-1 is monitoring for search query inputs suggesting that the person is interested in some topic related to occupational accidents. If the original search had no triggering keywords, then SAOA-1 would not present its suggested alternative search and associated suggestions for refinements. Even if SAOA-1 was triggered and suggested using its Alternative Search, the user could decide SAOA-1 was simply wrong and ignore its suggestions.

64

Figure 3.7: A screen image of the first ten search results for “hand” and “cut” from Alternative Search (Retrieved February 28, 2008)

65

Figure 3.8: A screen image of the partially zoomed in version of Figure 3.7 for readability (Alternative Search)

66

Figure 3.9: A screen image of the list of Suggested refinements for Alternative Search 67

Figure 3.10: A screen image of the partially zoomed in version of Figure 3.9 for readability (Suggested refinements for Alternative Search) 68

Figure 3.11: A screenshot of the dialog box displaying the new query for Modified Search. (The dialog box appears when the Show Modified Search button is clicked.)

69 A screen image of the first ten search results using the Modified Search is displayed in Figure 3.12. Figure 3.13 shows a screen image of the partially zoomed in version of Figure 3.12 for readability.

All of the first ten documents from Original Search (Google Search) are totally irrelevant to the topic of occupational accidents. Seven out of the ten documents are related to wood products such as hand cut wooden puzzles and dovetails. The other three documents are related to hand cut crystal gifts, a clothing store with the name handcut, and hand-cut stones as siding materials.

The first ten documents from Alternative Search, on the other hand, are all related to the topic of occupational accidents. Among these ten documents, seven documents are related to injuries due to cuts to the hands. Another three documents, Document 4, 5, and

8, are related to the search keywords “hand” and “cut” but are not directly related to injuries due to hand cuts. Document 4, reporting occupational injuries involving wood chippers published by Centers for Disease Control and Prevention, contains the phrases

“keep hands and feet away” and “cutting bush” (Struttmann, 2004). Document 5, recording a list of cases involving logging incidents published by the National Institute for Occupational Safety and Health (NIOSH), contains the phases “ranch hand” and

“timber cutter” (“Logging Fatality,” n.d.). Document 8, recording a summary of worker injuries in Switzerland published by the International Labour Office Bureau of Statistics, contains the phrases “hand tool” (“Switzerland,” n.d.).

The first ten documents from Modified Search are all related to occupational accidents as well. Among these ten documents, nine documents are related to injuries due to hand cuts. Document 2 is the only document that is not related to injuries caused by

70

Figure 3.12: A screen image of the first ten search results for “hand” and “cut” from Modified Search (Retrieved March 1, 2008)

71

Figure 3.13: A screen image of the half/partially zoomed in version of Figure 3.12 for readability (Modified Search)

72 hand cuts from the Modified Search. Document 2 discussing hiring conscientious employees to cut down on workplace accident costs contains the phrases “on the other hand” and “cut down” (Haaland, 2006).

Table 3.2 shows a summary of the results indicating number of relevant documents pertaining to “occupational accidents” and the query “hand” and “cut” within this topic (based on the author’s judgment).

Topic Number Document Note Original Search (Google Search) Occupational accidents 0 Totally irrelevant 10 Wood related 7 3, 4, 6, 7, 8, 9, 10 hand cut puzzle, dovetail, etc. Crystal Gifts 1 1 hand cut crystal Clothing Store 1 2 handcut Siding Products 1 5 hand-cut stone

Alternative Search Occupational accidents 10 Totally relevant 7 1, 2, 3, 6, 7, 9, 10 hand, cut Partially relevant 3 4 hands away/cutting bush 5 ranch hand/timber cutter 8 hand tool

Modified Search Occupational accidents 10 Totally relevant 9 1, 3, 4, 5, 6, 7, 8, hand, cut 9, 10 Partially relevant 1 2 on the other hand/cut down

Table 3.2: A summary of the number of relevant documents in the first ten search results from Original Search, Alternative Search, and Modified Search, respectively (Retrieved March 1, 2008)

73 3.3 The Assistance of SAOA-1

While using the search keywords from the text box to perform the Original Search,

SAOA-1 is also evaluating whether its expertise could be relevant to help the searcher based on the search keywords as described earlier. If SAOA-1’s expertise is not judged to be relevant in helping an information seeker based on the search keywords, SAOA-1 simply retrieves and displays the documents without providing additional assistance (i.e.,

SAOA-1 remains invisible). But if SAOA-1 is triggered, it provides an Alternative

Search and suggests refinements for the Alternative Search to help the information seeker.

The test results in Section 3.2 illustrated how both the Alternative Search and the

Modified Search could achieve greater precision (Salton & McGill, 1983) than the

Original Search (Google search) for the first ten retrievals (See Table 3.2). This was accomplished by applying several steps, including the use of a semantic representation derived from a government database that was constructed by OSHA for other purposes.

3.3.1 The OSHA Database

Among all the abundant occupational safety and health information, and tools

OSHA provides, OSHA keeps “adequate records of all occupational accidents and illnesses for proper evaluation and necessary corrective action” according to the

Occupational Safety and Health Act of 1970 (“Occupational Safety ,” 2004). One of the databases provides records entitled “Accident Investigation Summaries”, which contains the results of accident investigations. The purpose of this database is not to assist with

Internet searches but to “enable the user to search the text of the Accident Investigation

Summaries (OSHA-170 Form) which result from OSHA accident inspections” as stated

74 in OSHA’s website (“Statistics & Data,” 2008). OSHA’s accident investigations encompass all public and private sectors in all industries in all 50 states. Information related to accidents such as “Nature of Injury”, “Part of Body”, “Source of Injury”,

“Event Type”, “Environmental Factor”, and “Human Factor” are recorded from Item 10 to Item 15 in the Investigation Summaries Form, OSHA-170 Form (APPENDIX A). The investigation summaries are accessible by the general public through the “Accident

Investigation Search” interface (http://www.osha.gov/pls/imis/accidentsearch.html) provided by OSHA.

To complete the Investigation Summaries Form, OSHA also provides a list of controlled vocabulary terms belonging to these different categories: “Nature of Injury”,

“Part of Body”, “Source of Injury”, “Event Type”, “Environmental Factor”, and “Human

Factor”, in order to describe different injuries using a standard terminology. Table 3.3 shows these controlled vocabulary terms for the category “Part of Body”. The complete list of these categories and the controlled vocabulary terms of each category is listed in

APPENDIX B (“Selected Occupational,” 1991).

One of the tables contained in the Accident Investigation Summaries is the

“Injury Line” Table. The “Injury Line” contains information on each injury caused by an occupational accident, if injury was incurred. This “Injury Line” includes specific information about the party injured and the injury itself, including: the injurer’s Age, Sex,

Nature of Injury (ex. Cut/Laceration), Part of Body (ex. Hand(s)), Source of Injury (ex.

Machine), Environmental Factor (ex. Shear Point Action), Human Factor (ex. Position

Inappropriate For Task), and Degree of Injury (ex. Hospitalized Injury), etc. Table 3.4 and Table 3.5 show two sample tables from separate OSHA documents created using the

75 Accident Investigation Summaries Form. Table 3.4 records a cut or laceration injury to hand(s) caused by a machine. Table 3.5 records a injury to hand(s) caused by the heat on a work surface. In essence, these tables in Appendix B provide a semantic representation for characterizing occupational injuries, conceptualizing classes of injury, and associating concepts in each class. They also provide knowledge about known relationships among concepts across these classes (Tables 3.3, 3.4 and 3.5).

Fundamentally, SAOA-1 re-purposes this knowledge to support information retrieval.

01 Abdomen 17 Lower Arm(s) 02 Arm(s) Multiple 18 Lower Leg(s) 03 Back 19 Multiple 04 Body System 20 Neck 05 Chest 21 Shoulder 06 Ear(s) 22 Upper Arm(s) 07 Elbow(s) 23 Upper Leg(s) 08 Eye(s) 24 Wrist(s) 09 Face 25 Blood 10 Finger(s) 26 Kidney 11 Foot/Feet/Toe(s)/Ankle(s) 27 Liver 12 Hand(s) 28 Lung 13 Head 29 Nervous System 14 Hip(s) 30 Reproductive System 15 Knee(s) 31 Other Body System 16 Leg(s)

Table 3.3: Controlled vocabulary for the semantic class “Part of Body” for OSHA’s Investigation Summaries Form (“Selected Occupational,” 1991) Source: OSHA, http://www.osha.gov/

76 Inspection Nr 304947492 Investigation Nr 201632320 Line Nr 1 Age 52 Sex M Nature of Injury Cut/Laceration Part of Body Hand(S) Source of Injury Machine Event Type Caught In Or Between Environmental Factor Shear Point Action Human Factor Position Inapropriate For Task Occupation Butchers and meat cutters Degree of Injury Hospitalized injury Task Assigned Task regularly assigned

Table 3.4: Sample of Injury Line: Fred Meyer Stores, Inc. (“Fred Meyer,” 2001) Source: OSHA, http://www.osha.gov/

Inspection Nr 303614655 Investigation Nr 171060593 Line Nr 1 Age 19 Sex F Nature of Injury Burn/Scald(Heat) Part of Body Hand(S) Source of Injury Working Surface Event Type Other Environmental Factor Work-Surface/Facil-Layout Cond Human Factor Misjudgment, Haz. Situation Occupation Supervisors, food preparation & service occupation Degree of Injury Non Hospitalized injury Task Assigned Task regularly assigned

Table 3.5: Sample of Injury Line: Sugarhouse Barbecue. (“Sugarhouse Barbecue,” 2001) Source: OSHA, http://www.osha.gov/

77 3.3.2 Search Agent Role in Refining Terms and Increasing Result Relevancy

Although the information seeker has a great deal of influence regarding the degree of accuracy and pertinence of each search, and ultimately decides the resultant relevancy, a search agent can provide assistance to improve search performance. In the case of this study, a five-step process is executed in which our search agent, SAOA-1, in response to suggest search query refinements when a user enters a set of keywords.

Step 1: Activates upon keyword recognition (related to occupational accidents)

Step 2: Constructs an “Alternative Search” using search engine knowledge

Step 3: Applies semantics and knowledge extracted from the OSHA database

Step 4: Identifies conceptual relationships and focuses attention on related concepts

Step 5: Presents refinements to the user for the “Alternative Search”

To better understand how SAOA-1 carries out each function, each step is broken down to explain the technical process in more comprehensive terms.

Step 1: Activates upon keyword recognition (related to occupational accidents)

SAOA-1 mimics assistance provided by an expert human intermediary,

contributing helpful search information as would a librarian or research assistant.

As a topic-oriented search agent, SAOA-1 helps the information seeker refine and

identify relevant results when it recognizes a search topic it is familiar with. If the

user is looking for information related to a topic familiar to SAOA-1, it will assist

by contributing its domain knowledge. SAOA-1 is designed as a topic-oriented

78 search agent using “occupational accidents” as its topic of interest. Trigger terms associated with “occupational accidents” were carefully constructed and hard- coded into the program based on a combination of human input terms and the terms contained in tables (Table 3.3 and its equivalence for the other concept classes, see Appendix B) retrieved automatically from OSHA’s Accident

Investigation Summaries database. If one or more search keyword(s) match any of these trigger terms, SAOA-1 is activated and prepares information to assist the client. Currently, the trigger terms are constructed with the following syntactical and grammatical variants in mind:

a. Cognates and word variations (singular/plural forms):

Ex. Hand and hands are both trigger words for the concept hand . Cut

and cuts are trigger words for the concept cut . Face , faces , and facial

are trigger words for the concept face . Asphyxia , asphyxiation , and

asphyxiate are trigger words for the concept asphyxia .

b. Synonyms (words of the same or nearly the same meaning):

Ex. Laceration will trigger the concept cut and visa versa.

c. Related Words/Term Subtopics (“is a kind of”):

Ex. Hand and palm are trigger words for the concept hand . Eye ,

eyelash , cornea , retina , and pupil are used as the trigger words for the

concept eye .

The trigger words in the examples above are just a few samples of search terms used to trigger SAOA-1 and mainly pertain to specific types of bodily injury under the general veil of “occupational accidents”. When conducting a 79 search on this topic, SAOA-1 recognizes the words business , career , employ ,

employee , employer , employment , job , job-related , occupation , occupational ,

office , OSHA , profession , professional , vocation , vocational , work , working ,

workplace , and work-place as trigger words for the concept “occupational”.

Likewise, injury , injuries , accident , and trauma are recognized as trigger words

for the concept “accidents”.

In summary, SAOA-1 is activated if one or more search keywords match

any trigger words for the concepts “accidents” as listed above, or match any of the

terms in the OSHA tables for Part of Body , Nature of Injury , or Source of Injury

as shown in Appendix B.

Step 2: Constructs an “Alternative Search” using search engine knowledge

When SAOA-1 is activated, an alternative query is formed in the

background to assist users in defining a more focused query. This is particularly

geared toward users who apply basic search concepts and are not aware of

additional services provided by the search engine to help refine their terms. In

Google, for example, the order of search keywords affects the search results and a

synonym search feature is available (“Advanced Search,” 2008; “Google SOAP,”

2008; “The Essentials,” 2008).

Google states that, “Keep in mind that the order in which the terms are

typed will affect the search results.” (“The Essentials,” 2008). For example, the

search query “occupational accidents” returns 348,000 results and the search

80 query “accidents occupational” returns 615,000 results. Figure 3.14 shows a screenshot of the first ten Google generated search results for keywords

“occupational accidents”. Similarly, Figure 3.15 shows a Google screenshot of the first ten search results using keywords “accidents occupational”. Comparing these two sets of documents, only one resulting link appeared on both screenshots (No.

7 in Figure 3.14, No. 8 in Figure 3.15).

Note that Google claims that, “By default, Google only returns pages that include all of your search terms. There is no need to include "and" between terms”

(“The Essentials,” 2008). If the search keywords contain the term “and”, Google also displays a message, “The "AND" operator is unnecessary -- we include all search terms by default”, right below the search box as shown in Figure 3.16.

However, based on the search results shown in Figure 3.14 and 3.15, Google’s

AND operator does not seem to be a logical operator. If it were logical, using the search keywords “occupational accidents” would retrieve the same set of results as using the search keywords “accidents occupational”. Even though it is not clear how Google’s “AND” operator functions in the background, this search engine knowledge (the order of search keywords affects the search results) has been embedded as a search engine expertise in SAOA-1.

The synonym search is an advanced search feature provided by Google.

Google’s synonym search investigates not only search keywords submitted by the user, but also synonyms of the search keywords themselves. For example, if the search keyword is occupational , Google will also search for its synonyms

81

Figure 3.14: Google’s first ten search results of keywords “occupational accidents” (Retrieved February, 27, 2008)

82

Figure 3.15: Google’s first ten search results of keywords “accidents occupational” (Retrieved February, 27, 2008)

83

Figure 3.16: Google’s comments on the AND operator (Retrieved March 26, 2008)

84 including career , physical , and workplace . If the search keyword is accidents ,

Google will also search using injuries , crashes , and fatalities .

Google performs the synonym search when a term in the query has a tilde sign (“~”) in front of a search keyword. Keyword synonyms used by Google can be detected by submitting a synonym keyword search and then, by using the minus sign (“–”), excluding the keyword itself: “~keyword - keyword” (“Fun with,” 2003). This exclusive search using a minus sign (“–”) is another search feature provided by Google (“Advanced Search,” 2008).

Abstractly, adding keywords related to the topic of interest would retrieve more relevant documents. Using a synonym search would retrieve additional relevant documents and potentially increase recall (Salton & McGill, 1983). Thus, because SAOA-1 uses Google to retrieve documents, an alternative query is constructed based on these two aspects of search engine knowledge: the order of search keywords and the synonym search as described above.

Upon activating SAOA-1, the alternative query is constructed to incorporate search engine knowledge. The alternative query expands and incorporates the topic of interest and the capability of the synonym search with the original query. The topic of interest is blended with the original query by adding keywords related to topic of interest, i.e., occupational accidents , in front of the original query (the keyword “occupational” was placed before the keyword

“accidents” to emphasize occupational ). The capability of synonym search is introduced into the original query by adding a tilde sign in front of every search keyword in the query.

85 The keywords “occupational accidents” were hard-coded by a human

expert to represent the topic of interest and then automatically positioned in front

of the original search keywords to apply the search engine knowledge (relevant to

Google searches) that the order of the search keywords affects search results. This

canned query component could be made as sophisticated as desired. The tilde sign

is then also added in front of every search keyword in the query to perform a

synonym search. Note that abstractly this second step involves studying the

characteristics of the search engine and writing an algorithm that automatically

makes use of this information to modify the query for the users.

Step 3: Applies semantics and knowledge extracted from the OSHA database

The knowledge base developed from a pre-existing relational database

offers the potential to help the information seeker better define the topic of

interest. “Occupational accidents” was the sample domain of interest while

building SAOA-1 using a relational database available to the public (OSHA).

Summaries of occupational accidents investigated by OSHA are available in

OSHA’s Accident Investigation Search database. These summaries contain injury

information such as the controlled vocabulary terms and their associated

categories: “Nature of Injury”, “Part of Body”, “Source of Injury”, “Event Type”,

“Environmental Factor”, and “Human Factor”, administered by OSHA

(APPENDIX B). This information is extractable and used to establish conceptual

correlations in semantic classes related to occupational injuries. For example, a

cut injury or a burn injury caused by heat belongs to the concept class “Nature of

86 Injury”. A hand injury belongs to the concept class “Part of Body”. A machine or

a working surface related injury belongs to the concept class “Source of Injury”.

These conceptual classes and associated concepts are applied to create the

domain-specific knowledge base of SAOA-1. Table 3.6 shows a raw list of

concepts for the semantic class “Source of Injury”. As shown, there are a total of

forty-eight concepts associated with this particular conceptual class, which will be

pruned to become more “meaningful” and “significant” prior to suggesting an

adaptation to the search (broadening or narrowing). This is discussed in the

following sections.

Step 4: Identifies conceptual relationships and focuses attention on related concepts

The OSHA documents created using the Accident Investigation

Summaries Form implicitly indicate semantic relationships among the controlled

vocabulary terms. These indexed documents contribute to identify the semantic

relationships across the conceptual classes and their associated concepts. The

identifiable relationships among these concepts are used to focus attention on

potentially relevant terms to broaden or narrow a query.

Based on the documents of the Accident Investigation Summaries

retrieved from the OSHA database, Table 3.7 uses cut/laceration injuries and

burn/scald injuries to hands as an example to illustrate the conceptual

relationships across semantic classes. Twenty-three concepts from the Source of

Injury List had relationships with cut or laceration injuries to hands and twenty

concepts of Source of Injury had relationships with heat related burn or scald

87 01 Aircraft 25 Ladder 02 Air Pressure 26 Machine 03 Animal/Insect/Bird/Reptile/Fish 27 Materials Handling 04 Boat Equipment 28 Metal Products 05 Bodily Motion 29 Motor Vehicle (Highway) 06 Boiler/Pressure 30 Motor Vehicle (Industrial) 07 Boxes/Barrels, etc. 31 Motorcycle 08 Buildings/Structures 32 Windstorm/Lighting, etc. 09 Chemical Liquids/Vapors 33 Firearm 10 Cleaning Compound 34 Person 11 Cold (Environmental/Mechanical) 35 Petroleum Products 12 Dirt/Sand/Stone 36 Pump/Prime Mower 13 Drugs/Alcohol 37 Radiation 14 Dust/Particles/Chips 38 Train/Railroad Equipment 15 Electrical Apparatus/Wiring 39 Vegetation 16 Fire/Smoke 40 Waste Products 17 Food 41 Water 18 Furniture/Furnishings 42 Working Surface 19 Gases 43 Other 20 Glass 44 Fume 21 Hand Tool (Powered) 45 Mists 22 Hand Tool (Manual) 46 Vibration 23 Heat (Environmental/Mechanical) 47 Noise 24 Hoisting Apparatus 48 Biological Agent

Table 3.6: Concepts associated with the semantic class “Source of Injury” (“Selected Occupational,” 1991) Source: OSHA, http://www.osha.gov/

88 Part of Body Nature of Injury Source of Injury Hand(s) Cut/Laceration Air Pressure Aircraft Animal/Insect/Reptile/Fish Bodily Motion Boiler/Press Vessel Buildings/Structures Electrical Apparatus/Wiring Fire/Smoke Gases Glass Hand Tool (Manual) Hand Tool (Powered) Hoisting Apparatus Ladder Machine Materials Handling Metal Products Motor Vehicle (Highway) Motor Vehicle (Industrial) Pump/Prime Mover Train/Railroad Equip Waste Products Working Surface Burn/Scald (Heat) Aircraft Boat Bodily Motion Boiler/Press Vessel Chemical Liquids/Vapors Cleaning Compound Dust/Particles/Chips Electrical Apparatus/Wiring Fire/Smoke Gases Heat (Environmental/Mechanical) Hoisting Apparatus Machine Materials Handling Metal Products Motor Vehicle (Highway) Petroleum Products Waste Products Water Working Surface

Table 3.7: Sample of pruned lists of terms showing concept relationships across semantic classes using hand related injuries

89 injuries to hands as organized in Table 3.7. Note that even though there are forty- eight possible concepts for sources of injuries as pre-classified by OSHA as shown in Table 3.6, not all concepts had relationships with cut/laceration injuries to hands or burn/scald injuries to hands. This occurs because there may not be relationships or reasonable correlations among some concepts.

The fact that the concepts had no correlation might be because the relationships were not meaningful. For example, the concept Chemical Liquids or

Vapors would be related to the concept of Burn injuries but would not be related to the concept of Cut or Laceration injuries as in Table 3.7. However, it seems to be odd that the sources of Cut or Laceration injuries to hands involve Fire/Smoke or Gases . Upon further inspections, one of these accidents reported that two employees sustained multiple cuts to their hands from explosion shrapnel as a result of accumulated gasoline fumes in the basement under their work site

(“Accident: 14314603,” 1987). The Source of Injury in this accident was reported as Fire/Smoke as shown in Table 3.8 (“Definzio Imports,” 1987). In these cases, it seems that the “Source of Injury” likely refers to the source of the accident instead of the direct source that contacted with body and physically caused the injury. As a result, the references made by SAOA-1 are sometimes erroneous.

The fact that some concepts were not correlated might be because the relationships had never been formed during investigations. For example, Table

3.9 recorded a leg fracture injury caused by a m otorcycle . However, although cut or laceration injuries to hands might be caused by a motorcycle , no such injuries

90 Inspection Nr 017995044 Investigation Nr 014314603 Line Nr 7 Age 31 Sex M Nature of Injury Cut/Laceration Part of Body Hand(S) Source of Injury Fire/Smoke Event Type Struck By Environmental Factor Flying Object Action Human Factor Misjudgment, Haz. Situation Occupation Occupation not reported Degree of Injury Non Hospitalized injury Task Assigned Task regularly assigned

Table 3.8: Injury Line: Definzio Imports, Inc. (“Definzio Imports,” 1987) Source: OSHA, http://www.osha.gov/

Inspection Nr 119820454 Investigation Nr 201113511 Line Nr 1 Age 30 Sex M Nature of Injury Fracture Part of Body Legs Source of Injury Motorcycle Event Type Other Environmental Factor Other Human Factor Other Occupation Editors and reporters Degree of Injury Hospitalized injury Task Assigned Task regularly assigned

Table 3.9: Injury Line: Daisy Publishing Company Inc. (“Daisy Publishing,” 2001) Source: OSHA, http://www.osha.gov/

91 had been reported in the Accident Investigation Summary at the time the data

were collected.

Step 5: Presents refinements to the user for the “Alternative Search”

By combining the knowledge of Step 3 and Step 4, SAOA-1 automatically

generates a set of context sensitive recommendations for relevant semantic classes

and their associated concepts based on the semantic relationships among the

concepts extracted from the OSHA documents.

To suggest refinements to assist a search, a threshold was established so

that relationships had to be not only meaningful but also “significant.” For

broadening searches, the generated list displays all concepts within each

conceptual class. For narrowing searches, the generated list is pruned using a one

percent threshold to indicate a “significant” relationship among the concepts.

Table 3.10 shows a sample of the pruned list suggesting refinements for

cut/laceration and heat related burn/scald injuries to hand(s). In Table 3.10,

concepts for cut or laceration injuries to hand(s) have been narrowed down from

twenty-three to eleven concepts and concepts for heat related burn or scald

injuries to hand(s) have been narrowed down from twenty concepts to eleven

concepts, both based on the one percent threshold. Note that “Fire/Smoke” is still

listed as a Source of Injury for cut/laceration injuries to hand(s) but “Gases” is not.

This indicates that “Fire/Smoke” had been recorded as the Source of Injury for at

least one percent of the cut/laceration injuries to hand(s) in the Accident

Investigation Summary at the time of data collection.

92 Part of Body Nature of Injury Source of Injury Hand(s) Cut/Laceration Air Pressure Buildings/Structures Fire/Smoke Glass Hand Tool (Manual) Hand Tool (Powered) Hoisting Apparatus Machine Materials Handling Metal Products Working Surface Burn/Scald (Heat) Chemical Liquids/Vapors Dust/Particles/Chips Electrical Apparatus/Wiring Fire/Smoke Gases Heat (Environmental/Mechanical) Machine Materials Handling Metal Products Petroleum Products Water

Table 3.10: Sample of a pruned list suggesting refinements for hand related injuries

By selecting concepts from the list of suggested refinements, the

information seeker can modify the Alternative query by broadening and/or

narrowing the search. The list and its display are automatically generated by

SAOA-1. Concepts having semantic relationships with keywords in the original

queries are retrieved from each of the three selected conceptual classes, i.e.,

“Nature of Injury”, “Part of Body”, and “Source of Injury”, from the established

knowledge base. The hierarchical display of the list uses a tree structure for

content navigation. The conceptual classes are displayed under the branches

93 “Broaden” and/or “Narrow by” depending on which conceptual class contains the keywords of the original query. The concepts are then displayed under the branch of the associating conceptual classes. For example, if the original query contains

“hand” and “cut”, then the conceptual classes Part of Body and Nature of Injury will be listed under the branch “Broaden” to allow users to broaden the search by selecting other concepts such as arms in “Part of Body” and/or burn in “Nature of

Injury”, and so forth. In the meantime, the conceptual class “Source of Injury” will be listed under the branch “Narrow by” to allow users to narrow down the search by selecting concepts such as machines in “Source of Injury”. The illustration of this example can also be found in the left hand section of Figure

3.10. When broadening, the displayed list includes all concepts in the conceptual class for the selection. When narrowing, the displayed list is pruned by only displaying concepts with significant relationships.

For display purposes, a plus sign (“+”) in front of each branch allows users to click to expand the branch. Similarly, clicking a minus sign (“–”) allows users to collapse the desired branch. A checkbox is displayed in front of each concept to allow the user to select or unselect concepts to form a modified query.

When forming a modified query, the concepts across conceptual classes are formed as the AND query and the concepts within the same conceptual class are formed as the OR query (Figure 3.11). The capitalized AND and OR are common search operators and can also be used in Google (search engine knowledge). The search operator AND does not need to be typed in with search keywords in queries since it is the search operator by default. The order of the keywords in the

94 modified query is also adjusted based on the order of the related keywords as they

appear in the original query to incorporate search engine keyword order

knowledge. For example, without selecting any additional concepts, if the original

search query is “hand cut”, the modified query will show “~occupational

~accidents ( ~hands ) ( ~cuts OR ~lacerations )” (Figure 3.11). However, if the

original search query is “cut hand”, the modified query will show

“~occupational ~accidents ( ~cuts OR ~lacerations ) ( ~hands )” as illustrated in

Figure 3.17.

3.4 Implementation Methods

There are three main components in SAOA-1: a knowledge base, a collection of algorithms, and an interface. The knowledge base contains the conceptual classes and associated concepts (i.e., the controlled vocabulary terms) and the semantic relationships among these concepts as extracted from documents in OSHA’s Accident Investigation

Search database. The algorithms carry out the major tasks: triggering SAOA-1, modifying the query to focus on occupational accidents (including synonyms), generating suggestions for broadening and narrowing the search, and showing the final display. The interface displays the final results and provides functions allowing user interaction.

3.4.1 Knowledge Base Preparation

Three conceptual classes: “Part of Body”, “Nature of Injury”, and “Source of

Injury” were used as our knowledge base for this study. The method used to extract and convert the knowledge from OSHA’s database to that of SAOA-1 is described as follows.

95

Figure 3.17: A screenshot of the dialog box displaying the new query for Modified Search (“cut hand”)

The contents of the “Injury Line” documents were automatically retrieved and extracted from OSHA’s database using a macro written in Microsoft Visual Basic (Version 6) in

Microsoft Excel spreadsheets (Version 2003) in December 2006. A total number of

64,702 “Injury Line” documents were retrieved, among which a total number of 61,021 documents were valid and without errors (i.e., Error 404 – File Not Found at the time of retrieval). 61,021 Records with only conceptual classes “Nature of Injury”, “Part of

Body”, and “Source of Injury” were extracted from the valid documents. In order to

96 construct a more meaningful knowledge base, concepts such as “Body System”,

“Multiple”, “Other Body System”, from the conceptual class “Part of Body”, “Person” from the conceptual class “Source of Injury”, and “Other” from conceptual classes

“Nature of Injury” and “Source of Injury” were removed. In a real system, these would either have to be removed manually or left in as “noise” for the user to deal with. A total number of 31,399 records were finally used to comprise the knowledge base of SAOA-1.

While maintaining the semantic relationships, concepts in each of the three conceptual classes were stored as a text file accessible by SAOA-1.

3.4.2 Software Development

3.4.2.1 Tools

The program uses three different computer languages including Hypertext

Markup Language (HTML, http://www.w3.org/TR/REC-html40/), JavaScript

(http://developer.mozilla.org/en/docs/About_JavaScript/), and Java (http://java.sun.com/).

The networking and computation are coded using Java. The interface is written in HTML and the Web functions (i.e., “Submit” buttons) are written in JavaScript. All the source code is written using NetBeans Integrated Development Environment (IDE) (Version

5.5.1, http://www.netbeans.org/) by the author. NetBeans is a free and open-sourced software development tool for creating Java applications. Additional tools used include

Microsoft FrontPage (Version 2003) and Mozilla Firefox (Version 2.0.0.6). They were used to test and debug the HTML and JavaScript codes. The application was tested on

Microsoft Windows Internet Explorer (IE) versions 6 and 7. Google, due to its lead

97 search engine status, was chosen to be the search engine to retrieve information from the

Internet. SAOA-1 retrieves documents by sending search queries to Google.

3.4.2.2 Software Development

“New”, “Original”, “Alternative”, and “Modified” are the four types of searches associated with SAOA-1. The algorithms conduct the following major tasks prior to generating the final display:

a. Query triggering and modification,

b. Use of knowledge base to suggest ways to broaden or narrow the search

c. Document retrieval

a. Query triggering and modification

The process starts with a New Search by reviewing the search keywords in the query of the Original Search. These search keywords are examined to determine whether they are related to the topic of interest (occupational accidents). In other words, the

“Original Search” is related to the topic of interest if any keyword in the input query matches the trigger terms that were constructed for SAOA-1.

b. Use of knowledge base to suggest ways to broaden or narrow the search

If the “Original Search” triggers SAOA-1, the agent constructs the “Alternative

Search” and “Suggested Refinements for Alternative Search” by executing the five steps described earlier in an attempt to assist the user in retrieving more relevant results. If the

“Original Search” is not related to the topic of interest, SAOA-1 is not activated.

98 However, SAOA-1 will still run the search, retrieve documents, and generate the final results for display.

When the search process repeats, SAOA-1 will identify whether the search is a

“New Search,” an “Original Search,” an “Alternative Search,” or a “Modified Search” and run the appropriate query accordingly.

c. Document retrieval

The first step in retrieving a document is to establish an Internet connection with

Google. SAOA-1 communicates with Google based on Google’s Uniform Resource

Locator (URL) (i.e., http://www.google.com/). SAOA-1 submits queries to Google by manipulating Google’s URL. For example, if the query is “hand cut”, SAOA-1 will construct a URL to include the query as in the form of http://www.google.com/search?q=hand+cut. In addition to submitting the query, parameters can be set to manipulate Google’s retrieval behavior (“Advanced Operators,”

2008; De Valk, 2007). For example, “num” specifies the number of documents displayed per page and “start” indicates which documents are displayed first. The URL http://www.google.com/search?q=hand+cut&num=100&start=0 returns the results for the query “hand cut”, displays 100 documents per page, and starts with the first document.

The URL http://www.google.com/search?q=hand+cut&num=30&start=4 returns the results for the query “hand cut”, displays 30 documents per page, and starts with the fifth document on each page. Note that the numbering of the first document starts with 0 instead of 1 and the document display order (1 st document, 2 nd document, etc.) is based on Google’s ranking algorithm.

99 The query-embedded URL is executed through a command line using Java but acts as if it is sent directly from an IE browser by setting the corresponding “User-Agent” string of the browser in the Java program (Hardmeier, 2006; tony.thompson, 2002;

“Understanding User,” 2008). For the purpose of this research, only the first one hundred documents from Google are retrieved. After retrieving the information from Google, the contents are analyzed to identify the actual documents needed. The sponsor links, advertisements, and other additional information such as Google’s styles, links and images, if any, are all removed before the final display.

3.4.2.3 The Interface

The interface displays the final results and allows user interaction. The final display of the interface is composed of top, left, right, and bottom sections. The top section contains the title, a text field to type in queries for “New Search,” and a “Search” button for submitting the search. The left section displays information and knowledge associated with the current search session (if the search is related to the topic of interest).

The right section displays the final output documents retrieved from Google. The bottom section contains links to access additional pages.

3.5 Functionalities of the Agents

The agent SAOA-1 has a variant called SAOA-0 (working title: Agent Phil).

SAOA-0 is a simplified version of SAOA-1. SAOA-1 was used for the treatment group while SAOA-0 was used for the control group in the empirical study.

100 3.5.1 Functionalities of SAOA-1

The interface for SAOA-1 is a set of Web pages in the HTML format. It utilizes two interface screen designs: home and results pages. For illustration purpose, the figures show the screen images for the first ten results only.

As briefly described in the Section 3.2, the homepage (Figure 3.4) is a standard

HTML page containing a text box for keyword entry, also called a text field, and a

“Search” button next to the text box to submit the search query. Users submit the search query by clicking the “Search” button using the mouse.

After submitting the search query, SAOA-1 retrieves results from Google and displays the retrieved documents in a newly generated HTML page – the results page.

The newly generated HTML page - the results page, is composed of top, left, right, and bottom sections.

The top section contains the text box and the Search button as on the homepage.

The text box displays the most recently run query to allow the user to append or modify the query for a “New Search”. The user may also simply enter new keywords to start a

“New Search”.

The left section contains the title “Search Query” and the following three segments: “Original Search,” “Alternative Search,” and “Suggested Refinements for the

Alternative Search.”

The “Original Search” displays the information on the original search query and the number of retrievals provided in Google’s outcomes. It keeps a record of what the user’s original query was to reduce the user’s memory load. The “Run Original Search”

101 button allows the user to revisit the search results of the “Original Search” without typing in the original search keywords again (Figure 3.6).

The “Alternative Search” is generated if SAOA-1 detects that the “Original

Search” query in correlated to keywords programmed into SAOA-1 (occupational accidents). The “Alternative Search” displays the “Alternative Search” query and the number of retrievals gathered by the “Alternative Search”. The “Run Alternative Search” button searches using this alternative query and displays the results in the right section with the label “Alternative Search” at the top when the results page is generated. The title

“Alternative Search” is highlighted. The query of “Alternative Search” “~occupational

~accidents ~hand ~cut” is displayed along with 2,260,000 (number of documents retrieved by Google) (Figure 3.8).

The domain-specific knowledge is presented as the “Suggested Refinements for

Alternative Search” used for broadening and/or narrowing the search. This includes the three conceptual classes: Nature of Injury, Part of Body, and Source of Injury and the associated concepts of each conceptual class from the knowledge base. The list of

“Suggested Refinements for Alternative Search” is displayed using a tree structure. Users are able to explore the conceptual classes and associated concepts by expanding or collapsing the branches. Users may refine the “Alternative Search” by selecting or unselecting suggested keywords to broaden or narrow their search. Note that the list of concepts under the “Narrow by” branch is pruned and only concepts with “significant” relationships are presented. In other words, only the concepts with both “hand(s)” and

“cut/laceration” concepts that pass the one percent threshold in the knowledge base are displayed (See Figure 3.10).

102 By clicking the “Show Modified Search” button, the modified query based on the refinements will be displayed in a pop-up dialog box showing the new queries for the

“Modified Search”. The new query is “~occupational ~accidents ( ~hands ) ( ~cuts OR

~lacerations )” in the example (Figure 3.11).

Clicking the “Run Modified Search” button will generate a new set of search results based on the modified query the users have selected. It runs the query displayed by clicking the “Show Modified Search” button. The newly generated results page based on the “Modified Search” will display the new modified query on top of the “Run

Modified Search” button and the number of retrievals of this “Modified Search.” The title

“Modified Search” is highlighted. The query of the adjusted search “~occupational

~accidents ( ~hands ) ( ~cuts OR ~lacerations ) ” is displayed along with 1,960,000

(Google’s retrieval number) (Figure 3.13).

The right section contains the label “Search Results” and displays the results retrieved from Google. Information on the retrieved documents is displayed in this right hand section on each results page. If results are generated by the “Run Original Search” button, the label of this right section will show “Original Search” in highlighted form. If the results are generated by the “Run Alternative Search” button or the “Run Modified

Search” button, the label will show highlighted “Alternative Search” or “Modified

Search,” respectively.

The information on each document includes a headline with a hyperlink to the actual document, a brief description of the document contents, and links to “Cached” ,

“Similar pages”, or “More results ...” provided by Google at the bottom of the description

103 (if available). In addition, an index number is added for each document. This feature is not found in Google’s display.

The bottom section provides links to additional retrieved documents. For the purpose of this research, only the first one hundred documents from Google are retrieved.

The first results page displays the search information and the first forty documents. The second results page displays the search information and the next forty documents and can be accessed from the link “2” in the bottom section. The third results page displays the search information and the rest of retrieved documents (documents eighty-one to one hundred), and can be accessed using the link “3” at the bottom section. For illustrative purposes, figures (refer to Figures 3.5, 3.7, and 3.12) all display ten documents per page.

There are ten links (1 to 10) at the bottom of the page to access additional pages.

3.5.2 Functionalities of SAOA-0

SAOA-0 was used as the control condition or version for an evaluation of SAOA-

1. SAOA-0 functions as SAOA-1 but does not apply knowledge to enhance the search.

SAOA-0 acquires search keywords but does not examine if the search keywords are related to the topic of interest as SAOA-1 does. It then reforms the query and composes the query embedded URL, retrieves documents, and generates the final results for display.

As with SAOA-1, SAOA-0 utilizes the homepage and results page screen styles.

The homepage of SAOA-0 has the same display as SAOA-1 except for the title. Like

SAOA-1, the results page of SAOA-0 is also composed of four sections (top, left, right, and bottom). Because SAOA-0 does not provide knowledge to assist with a search, the

104 left section of SAOA-0 only displays the information related to the original search

(original input search query and number of retrievals). Regarding the retrieved documents, those retrieved by SAOA-0 would be the same as the documents retrieved using the

Original Search from SAOA-1 if retrieved at the same time. A screen image of results from SAOA-0 using the search keywords “hand” and “cut” is displayed as Figure 3.18.

Figure 3.19 shows a partially zoomed in version of Figure 3.18 for readability. For illustrative purposes, only ten results are shown in each page and only these ten results are displayed.

105

Figure 3.18: A screen image of the first ten results from SAOA-0 (Retrieved March 1, 2008)

106

Figure 3.19: A screen image of the partially zoomed in version of Figure 3.18 for readability (SAOA-0)

107

CHAPTER 4

EMPIRICAL EVALUATION

4.1 Introduction

As discussed in Chapter 3, SAOA-1, a prototype of a topic-oriented search agent, is developed with embedded semantics and domain-specific knowledge extracted from a pre-existing database that was not intended to support Internet search. The brief example showed that both the Alternative Search and the Modified Search retrieved more relevant information (pertaining to occupational accidents) than the Original Search (Google

Search) did for the first ten retrievals. An empirical study was conducted to compare the first forty retrievals from SAOA-1 and those from SAOA-0 (Google). The goal of the empirical study was to evaluate how useful SAOA-1 is in assisting information seekers to retrieve relevant documents.

The rest of the chapter is organized as follows. Section 4.2, discusses methods for conducting the empirical study including the participants, the questionnaire, the procedure, and the analyses of data. The results of the empirical study are presented in

Section 4.3.

108 4.2 Methods

4.2.1 Participants

Each of the sixty participants was randomly assigned to one of two conditions:

Treatment 1 (SAOA-1) or Treatment 2 (SAOA-0). Thirty participants were exposed to

Treatment 1 only as the treatment group. The other thirty participants were exposed to

Treatment 2 only as the control group. All participants were adults eighteen years of age or older.

4.2.2 Questionnaire

The questionnaire contains four sections. Section I gathers general demographic information including the participant’s gender, age, education, and online usage history.

Specifically, the online portion collects individual data regarding Internet familiarity

(how long the participant has been using the Internet), Internet access location, Internet search frequency, and search engine preferences.

Section II demonstrates the variety of search agents implemented by the participants in the study, introducing the functions and displays of each search agent as described in Chapter 3. The participants in the treatment group were given a demonstration of SAOA-1 and the participants in the control group were given a demonstration of SAOA-0.

Section III informs and instructs the participants regarding the tasks required for the study, and provides an evaluation sheet for the participants to gauge the relevance of information garnered from their final search. The participants were instructed to perform a series of tasks in order to research information for a college term paper dealing with

109 occupational accidents. After reading two articles addressing hand injuries in the workplace, each participant had to decide upon a topic for his respective term paper. The first article describes an unusual penetrating injury to the hand caused by a plastic injection molding machine (Aiyer, Vij, & Rana, 1995). The second article reports the increasing number of hand injuries among kitchen workers and the high cost to the foodservice industry (Prewitt, 2001). Using the search agent provided, the participants executed online searches based on their topics using keywords they believed would generate the most relevant results. The participants could start a new query or refine an existing query until satisfied with the search results. Research was terminated when the participant felt the information collected was sufficient to write the term paper. The last step required participants to evaluate the relevance of the first forty online documents of the final search to their research topics. The online documents were ranked from highly irrelevant to highly relevant using a numerical range from one (highly irrelevant) to five

(highly relevant) with the Likert Scale (i.e., a measurement of preference with numerical scale for responses) provided on the evaluation sheet.

Section IV posed eight questions to the treatment group, investigating how useful each participant considered the interface. The first four questions asked how strongly the subject agreed or disagreed that the SAOA-1: (1) was easy to use, (2) helped to better refine the searches, (3) helped to better understand what the participant was retrieving, and (4) that the participant was successful in finding the desired information. Again, each participant indicated his or her satisfaction with SAOA-1 using the Likert scale. The next four questions were open-ended questions asking the participants to list: (1) the most negative aspects, (2) the most positive aspects, (3) to describe what aspects were

110 especially important, and (4) to provide additional comments, if any. The control group had only two questions that were also posed to the treatment group as question 2 and 4.

These questions asked how strongly the participant agreed or disagreed with the usefulness of SAOA-0. The full questionnaire, Section I to IV, including the two articles along with a consent form, and a receipt form, is included in Appendix C.

4.2.3 Procedure

Upon arrival, each participant signed a consent form indicating his or her willingness to participate in the study and filled out the information requested in Section I of the questionnaire (demographics and online background information). In order to familiarize themselves with the scope and process of the study, the participants previewed the search process and viewed a demonstration of the search agent to be used to perform the tasks. This demonstration included starting a new search in the text box, browsing result pages, reviewing the functions and information provided by each button, checking links to additional result pages, etc.

Upon reviewing Section III and reading the two articles provided, each participant decided on a term paper topic related to occupational accidents and began searching online using the provided search agent in order to gather enough relevant information to write the paper. Participants were informed that they could ask any questions at any time.

If participants needed a better set of retrievals to find the desired information, they could modify searches or start new searches as needed. Participants were also informed that observations and notes would be made while they were performing search tasks. The notes included but were not limited to queries the participants used, search agent

111 functions/buttons used, and page navigation (scrolling down the pages, clicking on links, going beyond the first page, etc.).

After participants finished researching, they were asked to rank the first forty documents listed in the first page of their final search or the documents retrieved from their most satisfactory search. Each document was rated for relevancy on the following five-point scale:

1 = Document highly irrelevant

2 = Document irrelevant

3 = Document neutral

4 = Document relevant

5 = Document highly relevant

The participant then completed Section IV that evaluated the usability of the interface and provided additional comments (if available). For each group, participants were also asked to provide opinions regarding the most negative, positive, and important aspects of the search agent. Finally, participants signed a receipt of acknowledgment for their $20 cash incentive.

4.2.4 Analyses of Data

All data were entered into Microsoft Excel (Version 2003) spreadsheets after the study. Statistical software applications Stata (Standard Edition, Version 9.2), Minitab

(Version 15.1.1.0), and SPSS (Version 16.0.1) were used to perform statistics analyses and cross-verify the results. An alpha level of 0.05 was used for all statistical tests.

112 The demographic information was summarized using descriptive statistics to show the characteristics of the participants and the usage of the Internet and Internet searches.

The key questions for this study are whether the treatment type had an overall effect on expected rating and also whether the document number had an overall effect on expected rating. In the study, each of the sixty participants rated the first forty documents retrieved by the search engine using the Likert scale ratings on a scale from one to five.

There were two dependencies. First, each participant produced forty observations, one corresponding to each document number. Second, these forty observations were not independent. In the search results, the order of the documents is expected to reflect how relevant the document would be in relationship to the search query based on relevance ranking by Google. Therefore, the expected value for document 1 would be expected to have the highest rating for participants in both groups, document 2 the next highest rating for participants in both groups, etc. This is due to the fact that Google has applied a relevance ranking algorithm for the outcomes.

In addition to gathering information about the control group and the treatment group in general, we are also interested in any statistically significant difference between the expected ratings of the control group and

a. the group of participants whose final results were produced using the Alternative

Search

b. the group of participants whose final results were produced using the Modified

Search

113 There are four executable buttons provided by SAOA-1 to assist in the search.

They are the “Run Original Search”, “Run Alternative Search”, “Show Modified Search”, and “Run Modified Search” buttons. Ideally, we would like all the treatment group participants take advantage of the four executable buttons provided by SAOA-1 (if

SAOA-1 was activated). However, participants in the treatment group might not click on any of the four executable buttons provided by SAOA-1 to assist in their search, even if the participants were exposed to the treatment (SAOA-1). For example, the participant might have seen the information on the Original Search, Alternative Search, Suggested

Refinements for Alternative Search, and Modified Search provided by SAOA-1, but might not click on any of the four executable buttons: “Run Original Search”, “Run

Alternative Search”, “Show Modified Search”, and “Run Modified Search”. With this in mind, we would like to know if there is any statistically significant difference between the expected ratings of the control group and the group of participants who clicked on any of the four executable buttons provided by SAOA-1 to submit searches. Here we assume that the participants who clicked on any of the four buttons described above during the search process obtained more domain-specific knowledge provided SAOA-1 than participants who did not click on any of the four buttons in the treatment group, regardless of whether or not the final results were produced using the Alternative Search or the Modified Search.

Therefore, the following research hypotheses may be tested:

1. The treatment type has an overall effect on expected rating. In other words, the expected rating in the treatment group is significantly higher than the expected rating in the control group.

114 2. The document number has an overall effect on expected rating. In other words, document number 1 has the highest expected rating for participants in both groups; document number 2 has the next highest expected rating for participants in both groups, etc.

3. The expected rating from the group of participants whose final results were produced by using the Alternative Search is higher than the expected rating of the control group.

4. The expected rating of the participants whose final results were produced by using the

Modified Search is higher than the expected rating from the control group.

5. The expected rating of the group who clicked on any of the four executable buttons provided by SAOA-1 to assist with the search is higher than the expected rating from the control group.

In addition to the expected rating for level of relevancy, we studied whether

SAOA-1 has an overall influence on search behaviors. The number of searches submitted by each participant was used to test whether the treatment group had a higher number of searches than the control group. Search patterns will be illustrated with a few examples, as well as evaluation results regarding interface usability and answers to the open-ended questions (listed in Appendix E).

4.3 Results

The results include general demographic information, research findings on the test results pertaining to previously described hypotheses, search behaviors, and the evaluation of interface usability.

115 4.3.1 Demographic Information

There were 60 participants involved in this study. Each participant was randomly assigned to either the treatment group or the control group. Each group included 30 participants. The general demographic information addresses gender, age, education, and

Internet usage and familiarity (usage history, access locations, search frequency, and search engine preferences.

a. Gender:

There were 26 females (43%) and 34 males (57%) that participated (Table 4.1,

Figure 4.1). Among the 26 female participants, 17 were in the treatment group and 9 were in the control group. Among the 34 male participants, 13 were in the treatment group and

21 were in the control group.

Treatment Control Percentage Cumulative Gender Group Group Total (%) Percentage (%) Female 17 9 26 43.3 43.3 Male 13 21 34 56.7 100.0 Total 30 30 60 100.0

Table 4.1: Gender distribution of participants

116 Gender

40 34 35 26 30 25 21 17 20 13 15 9 10

No. of Participants of No. 5 0 Female Male Gender

Treatment Group Control Group Total

Figure 4.1: Gender distribution of participants (from Excel)

b. Age:

The average participant age was 26.23 years old in the treatment group and 27.33 years in the control group. Ages of the entire participant population (60 people) ranged from 20 to 46. 27 participants (45%) were between the ages of 20 to 24 years old. 17 participants (28.3%) were between the ages 25 to 29 and 9 participants (15%) were between the ages of 30 and 34. There were 7 participants (11.7%) between the ages of 35 and 46 (Table 4.2, Figure 4.2).

117 Treatment Control Percentage Cumulative Age Group Group Total (%) Percentage (%) 20 ~ 24 14 13 27 45.0 45.0 25 ~ 29 9 8 17 28.3 73.3 30 ~ 34 4 5 9 15.0 88.3 35 ~ 39 2 2 4 6.7 95.0 40 ~ 46 1 2 3 5.0 100.0 30 30 60 100.0

Table 4.2: Age distribution of participants

Age

16 14 14 13 12 10 9 8 8 6 5 4 4 2 2 1 2 No. of Participants of No. 2 0 20 ~ 24 25 ~ 29 30 ~ 34 34 ~ 39 40 ~ 46 Age

Treatment Group Control Group

Figure 4.2: Age distribution of participants (from Excel)

c. Education

46 participants (76.7%) were college students and 14 participants (23.3%) were not college students at the time of the study. In both the treatment and the control groups,

23 participants were college students and 7 were not college students at the time of the study (Table 4.3, Figure 4.3).

118 College Treatment Control Percentage Cumulative Student Group Group Total (%) Percentage (%) Yes 23 23 46 76.7 76.7 No 7 7 14 23.3 100.0 Total 30 30 60 100.0

Table 4.3: Number of college students among the participants

College Student

23 25 23

20

15

10 7 7 5 No. of Participants of No. 0 Treatment Group Control Group College Student

Yes No

Figure 4.3: Number of college students in each group (from Excel)

d. Highest level of education completed

Among those 14 participants who were not college students at the time of the study, 9 participants had a bachelor's degree, 3 participants had a master’s degree, and 2 participants were Ph.D.s. The participants who were college students included not only undergraduate students but also students in medical school, Master programs, and Ph.D. programs. Table 4.4 shows the highest level of education completed by the participants.

Figure 4.4 shows the distribution of degrees held by the participants. 119 Treatment Control Percentage Cumulative Education Group Group Total (%) Percentage (%) Some College 10 13 23 38.3 38.3 Bachelor’s Degree 14 10 24 40.0 78.3 Master’s Degree 5 6 11 18.3 96.7 Doctoral Degree 1 1 2 3.3 100.0 Total 30 30 60 100.0

Table 4.4: Highest level of education completed by the participants

Highest Level of Education Completed

16 13 14 14 12 10 10 10 8 5 6 6 4 1 1

No. of Participants of No. 2 0 Some College Bachelor's Degree Master's Degree Doctoral Degree Degree

Treatment Group Control Group

Figure 4.4: Highest level of education completed by participants (from Excel)

e. Internet usage history

All participants had at least 4 to 6 years of experience using the Internet. 7 participants (11.7%) had been using the Internet between 4 to 6 years. 27 participants

(45.0%) had been using the Internet between 7 to 10 years. 26 participants (43.3%) had been using the Internet for 10 years or more. Table 4.5 and Figure 4.5 show Internet usage history among the participants. 120 Treatment Control Percentage Cumulative Years Group Group Total (%) Percentage (%) Less than 1 year 0 0 0 0.0 0.0 1 to 3 years 0 0 0 0.0 0.0 4 to 6 years 2 5 7 11.7 11.7 7 to 10 years 11 16 27 45.0 56.7 10 years or more 17 9 26 43.3 100.0 Total 30 30 60 100.0

Table 4.5: Internet usage history among participants

Internet Usage History

30 27 26 25 20 16 17 15 11 9 10 5 7 5

No. of Participants of No. 000 0 0 0 2 0 Less than 1 year 1 to 3 years 4 to 6 years 7 to 10 years 10 years or more No. of Years using Internet

Control Group Treatment Group Total

Figure 4.5: Internet usage history among participants (from Excel)

f. Internet access locations

All participants responded about locations they frequent to access the Internet. 7 participants (11.7%) said they accessed the Internet from only one place. 53 participants

(88.3%) accessed the Internet from more than one location. 25 participants (41.7%) used three different Internet access locations followed by 13 participants (21.7%) who four

121 separate locations. 10 participants (16.7%) logged on to the Internet from two different locations. Only 5 participants (8.3%) used the Internet from five or more different locations. (Table 4.6, Figure 4.6)

Different Treatment Control Percentage Cumulative Locations Group Group Total (%) Percentage (%) 1 3 4 7 11.7 11.7 2 2 8 10 16.7 28.3 3 14 11 25 41.7 70.0 4 8 5 13 21.7 91.7 5 or more 3 2 5 8.3 100.0 Total 30 30 60 100.0

Table 4.6: Number of locations used to access the Internet

No. of Locations to Access the Internet

30 25 25 20 14 15 13 11 8 10 8 10 7 5 5 4 2 3 5 3 2 No. of Participants of No. 0 1 2 3 4 5 or more No. of Different Locations

Treatment Group Control Group Total

Figure 4.6: Number of different locations to access the Internet (from Excel)

122 Internet access locations included home and home office, school, work, public terminals such as the library, and other places. Unsurprisingly, home was the number one location to access the Internet. The tallied results showed that 57 participants (95.0% of the 60 participants) went online from the comfort of home. 48 participants (80.0% of the

60 participants) accessed the Internet from school. 37 participants (61.7% of the 60 participants) secured Internet access from work. 31 participants (51.7% of the 60 participants) accessed the Internet from public terminals. Finally, 6 participants (10% of the 60 participants) accessed the Internet from locations other than home, school, work, or public terminals. Note that the percentage used is the percentage of the 60 participants for each location. Therefore, the sum does not add up to 1. (Table 4.7, Figure 4.7). Also note that 46 participants indicated that they were students at the time of the study (See

Table 4.3). However, there were 48 participants who indicated that they had access to the

Internet from school (See Table 4.7). Upon further investigation, two participants (024 and 033) in the treatment group both indicated that they were not students but took advantage of school-based Internet hubs.

Internet Access Treatment Control Percentage* Locations Group Group Total (%) Home 28 29 57 95.0 School 25 23 48 80.0 Work 21 16 37 61.7 Public terminal 19 12 31 51.7 Other 3 3 6 10.0 Total 96 83 179

Table 4.7: Internet access locations *Percentage of 60 participants. Sum does not add up to 100%.

123 Locations for Internet Access

57 60 48 50 37 40 28 29 25 23 31 30 21 19 20 16 12 6 10 3 3 No. of participants of No. 0 From home From school From work From a public From other terminal places Locations

Treatment Group Control Group Total

Figure 4.7: Internet access locations (from Excel)

g. Internet search frequency

When questioned about how often they searched online, all 60 participants searched the Internet at least once a week. The majority of participants, 55 participants

(91.7%), responded that they conducted Internet searches on a daily basis. The reminder of the group, 5 participants (8.3%), indicated searching the Internet on a weekly basis.

Search Treatment Control Percentage Cumulative Frequency Group Group Total (%) Percentage (%) Daily 29 26 55 91.7 91.7 Weekly 1 4 5 8.3 100.0 Monthly 0 0 0 0.0 100.0 Less than once a month 0 0 0 0.0 100.0 Never 0 0 0 0.0 100.0 Total 30 30 60 100.0

Table 4.8: Internet search frequency

124 Internet Search Frequency

60 55 50 40 29 30 26 20

Frequency 5 10 4 1 0 0 0 0 0 0 0 0 0 0 Daily Weekly Monthly Less than once Never a month No. of Participants

Treatment Group Control Group Total

Figure 4.8: Internet search frequency (from Excel)

h. Internet search engine preferences

All 60 participants responded regarding their Internet search engine preferences and usage frequency. Each participant had to indicate how frequently they use a particular search engine in comparison to others (four popular commercial search engines and lesser engines preferred by the participants). Table 4.9 shows the relative usage of each search engine per group and the overall usage. The four most preferred commercial search engines were Google (http://www.google.com/), Yahoo (http://search.yahoo.com/),

Ask (http://www.ask.com/), and Live Search (http://www.live.com/). In addition, four alternative search engines were utilized by four of the participants. Those search engines included answers.com, baidu.com, AOL.com, and Wikipedia.org. Google was the most popular search engine in both the treatment and control groups (Figure 4.9). Based on the data collected, Google dominated the search engine preference results with an 86.3%

125 share followed by Ask with 9.2%. Live Search and Yahoo were 2.3% and 1.2%, respectively. The overall usage of the four search engines in the “Other” category counted for 1.0% (Figure 4.10).

Search Usage (%) Engine Treatment Group Control Group Overall Google 89.0 83.7 86.3 Yahoo 0.3 2.0 1.2 Ask 8.7 9.7 9.2 Live Search 2.0 2.7 2.3 Other 0.0 2.0 1.0 Total 100.0 100.0 100.0

Table 4.9: Group and overall Internet search engines preferences

Internet Search Engine Perferences by Group

1.000 0.890 0.837 0.800

0.600

Usage 0.400 0.097 0.200 0.087 0.003 0.0200.020 0.027 0.000 0.020 0.000 Google Yahoo Ask Live Search Other Search Engine

Treatment Group Control Group

Figure 4.9: Internet search engine preferences by group (from Excel)

126 Internet Search Engine Preferences

Live Search Other Ask 2.3% 1.0% 9.2%

Yahoo 1.2%

Google 86.3%

Figure 4.10: Overall Internet search engine preferences (from Excel)

4.3.2 Research Findings

A regression model with dummy variables to represent each participant (with repeated measurements per participant) was constructed to analyze the data. This regression model used indicator variables (Polissar & Diehr, 1982; Suits, 1984) to represent each of the sixty participants. It also contained an independent variable representing document order (as each participant viewed and rated forty documents that had been ranked for relevance by Google) and document order squared. (The combination of a linear and squared term for document order fit better than either variable alone). The dependent variable was the ranking of a given document for relevance by a given participant (on a scale from 1 to 5). This regression model was used to generate statistics to test the relevant hypotheses.

127 4.3.2.1 Control Group vs. Treatment Group

The first hypothesis tested was that there was no difference in expected ranking across participants in the control group vs. the treatment group. The regression model used as the basis for testing this hypothesis therefore was based on all 60 participants, with 40 observations per participant. Using an F statistic to test the hypothesis, the result suggests that there was significant difference in expected ranking across participants in the control group vs. in the treatment group. Thus, the hypothesis (i.e., no difference) was rejected at p < 0.001. The average effect (coefficient) for participants in the control group was β = 3.68 (Range from 2.52 to 5.50). For the treatment group, it was β = 3.97

(Range from 2.55 to 5.55). The R-square value indicated that the predictors explained

87.8% of the variance in the expected ranking. The adjusted R-square value was 87.42%, which accounts for the number of predictors in the model.

Figure 4.11 shows a histogram depicting the average rankings for both groups.

The average rating in the control group was 2.88 with a standard deviation 0.72 and a range from 1.725 to 4.700. The average rating in the treatment group was 3.17 with a standard deviation 0.85 and a range from 1.750 to 4.750.

Figure 4.12 shows the interaction plot for the average rating score of each of the forty documents by group. Although there appears to be a trend toward a greater difference for lower rankings, this interaction effect is not statistically significant between the treatment and control groups.

In this model, both the document number ( p < 0.001) and document number squared ( p < 0.001) were also statistically significant as prediction of the expected rating.

In addition, the negative value of the combined estimated effect of document number (β = 128 Histogram

6 5 5 5 4 4 4 4 3 33 3 3 3 3 3 2 2 2 2 2 2 2

Frequency 1 1 1 1 1 1 0 0 0 0

5 5 0 2 2 25 0 1. 1.50 1.75 2.00 2. 2.50 2.75 3.00 3. 3.50 3.75 4. 4.25 4.50 4.75 Rankings

Control Group Treatment Group

Figure 4.11: Histogram depicting the average rankings for both groups (from Excel)

Interaction Plot for Score Data Means

4.0 Grp CG TG

3.5

3.0 Mean

2.5

2.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Document

Figure 4.12: Interaction plot for the average rating score of each of the forty documents by group (from Minitab) 129 -0.0616) and document number square (β = -.0008) indicates that as the document number increases, the expected rating decreases. Therefore, the expected value for document 1 is expected to have the highest rating for participants in both groups, document 2 the next highest rating for participants in both groups, etc. (If only the transformed variable document number square is used to present the variable document number in the regression model, the estimated effect of document number square is also negative, β = -0.0006).

4.3.2.2 Control Group vs. Alternative Search Group

The third hypothesis tested was that there was no difference in expected ranking across participants in the control group vs. the Alternative Search group. This Alternative

Search group consisted of 3 participants, whose final results were produced using the

Alternative Search button. The regression model used for testing this hypothesis, therefore, was based on 33 participants (30 in the control group and 3 in the Alternative

Search group), with 40 observations per participant. Using an F statistic to test the hypothesis that there was no difference in expected ranking across participants in the control group vs. the Alternative Search group, this hypothesis was rejected at p < 0.001.

The average effect (coefficient) for participants in the control group was β = 0.47 (Range from -0.69 to 2.29). The average effect (coefficient) for participant in the Alternative

Search group was β = 0.42 (Range from -0.56 to 1.49). However, due to the very small size of samples, we were not able to draw a conclusion. The average rating in this

Alternative Search group was 2.83 with a standard deviation 1.03 and a range from 1.850 to 3.900. 130 4.3.2.3 Control Group vs. Modified Search Group

The fourth hypothesis tested was that there was no difference in expected ranking across participants in the control group vs. the Modified Search group. This Modified

Search group consisted of 9 participants, whose final results were produced using the

Alternative Search button. The regression model used for testing this hypothesis, therefore, was based on 39 participants (30 in the control group and 9 in the Modified

Search group), with 40 observations per participant. Using an F statistic to test the hypothesis, that there was no difference in expected ranking across participants in the control group vs. the Modified Search group, this hypothesis was rejected at p < 0.001.

The average effect (coefficient) for participants in the control group was β = 0.65 (Range from -0.50 to 2.48). The average effect (coefficient) for participant in the Modified

Search group was β = 0.91 (Range from 0.00 to 2.23). The average rating in this

Modified Search group was 3.13 with a standard deviation 0.66 and a range from 2.225 to

4.450.

4.3.2.4 Control Group vs. “Four Buttons” Group

Finally, the last hypothesis tested was that there was no difference in expected ranking across participants in the control group vs. the “Four Buttons” group. This “Four

Buttons” group consisted of 24 participants, who clicked on any of the four executable buttons provided by SAOA-1 for search assistance during the search process.

The regression model used for testing this hypothesis, therefore, was based on 54 participants (30 in the control group and 24 in the “Four Buttons” group), with 40 observations per participant. Using an F statistic to test the hypothesis, the results 131 suggested that there was a significant difference in expected ranking across participants in the control group vs. the “Four Buttons” group. Thus the hypothesis was rejected at p

< 0.001. The average effect (coefficient) for participants in the control group was β =

1.71 (Range from 0.56 to 3.53). The average effect (coefficient) for participant in this

“Four Buttons” group was β = 1.96 (Range from 0.58 to 3.58). This suggests that the effect of this “Four Buttons” group was greater than the effect of the control group. The average rating in this “Four Buttons” group was 3.12 with a standard deviation 0.84 and a range from 1.750 to 4.475

In summary, by comparing the average ratings between the control group (M =

2.88, SD = 0.72) and the treatment group (M = 3.17, SD = 0.85), the average expected ranking was increased 10% due to the treatment. The results suggested, overall, that information seekers benefited slightly with search assistance such as SAOA-1 to help them to refine their searches and retrieve more relevant documents. Note that SAOA-1 was constructed as a search agent that embedded with the semantic and domain-specific knowledge extracted from a relational database (other than support the Internet searches) and provided search assistance with an interaction interface. The results provided evidence that such a “search agent” model was able to slightly enhance IR in the domain on the Internet.

The orders of expected rankings of documents corresponded to the document numbers (i.e., rankings). In other words, on average top ranked documents (i.e., lower document numbers) received higher rankings than bottom ranked documents. The negative effect (i.e., coefficient) implied that Google’s ranking algorithms were at work as expected. 132 The Alternative Search group (M = 2.83, SD = 1.03) was found having a lower rating (a 2% decrease) than the control group (M = 2.88, SD = 0.72). This may suggest that applying the general search engine knowledge (such as orders of query terms and synonyms in general) alone may not be enough to produce more relevant documents.

However, the sample size (3 participants) was too small to draw a conclusion for this group.

The Modified Search group (M = 3.13, SD = 0.66), was found to have a slight improvement (a 9% increase) than the control group (M = 2.88, SD = 0.72). This may suggest that by applying the domain-specific knowledge (on top of the search engine knowledge), the Modified Search was able to assist information seekers to produce more relevant documents.

The “Four Buttons” group was examined by comparing its effect with the control group. Assuming those participants who clicked on any of the four executable search buttons provided by SAOA-1 during their search processes would receive certain domain-specific knowledge not available to those in the control group, regardless of how the final retrieval was produced (by which search button), the results suggested that this

“Four Buttons” group (M = 3.12, SD = 0.84) also had a slight improvement (a 8% increase) than the control group (M = 2.88, SD = 0.72).

4.3.3 Search Behaviors

In addition to the document relevance rating, three additional analyses were conducted concerning search behaviors:

a. average length of initial queries submitted by all searchers

133 b. treatment type influence on the searcher behaviors (number of searches submitted

before the participant was satisfied with the results)

c. search patterns identification among participants (how did the searchers quit?)

4.3.3.1 Average length of initial queries

The two documents (hand injuries in the workplace) given to the study participants (Section III of questionnaire) served to help them form individual term paper topics (search topics). A list of the search topics and initial queries for all 60 participants is shown in Table 4.10. Among the 60 initial queries, “hand injuries”, “Hand injuries”, or

“hand injury” was the most frequent submitted by 8 participants (13.3%), followed by

“occupational injuries” submitted by 3 participants (5%), and “workplace hand injuries” or “workplace hand injury” also submitted by 3 participants (5%). The average length of initial queries in the control group was 4.40 terms with a standard deviation 2.01 terms and a range from 2 to 9 terms. The average length of initial queries in the treatment group was 3.23 terms with a standard deviation 1.38 terms and a range from 2 to 9 terms. The test results suggest that the average length of initial queries in the control group ( M =

4.40, SD = 2.01) was statistically significantly longer than that of the treatment group ( M

= 3.23, SD = 1.38), t(58) = 2.62, p = 0.006). Figure 4.13 shows a histogram depicting the lengths (number of terms used) of initial queries for both groups. Figure 4.14 shows a graphic summary depicting the length of the initial queries submitted by all 60 searchers.

The average length of initial queries was 3.8 terms with a standard deviation of 1.8 terms and a range from 2 to 9 terms. This average query length is higher than the 2 terms reported by Jansen and Pooch (2001) and the 2.2 and 2.7 terms reported by Beitzel,

134 Participant Topic (T) / Initial Query (Q) 001 T: Occupational accidents to hands due to machinery misuse Q: hand injuries 002 T: Work related injuries Q: work related injuries 003 T: Hand injuries occur in the workplace based on uninformed ways of using the equipment, their surroundings, and not reporting accidents, no matter how small Q: stop hand injuries in restaurants 004 T: Common hand injuries in the workplace Q: occupational avulsion 005 T: Hand injuries at work Q: hand injuries at work 006 T: Hand injuries in the work place Q: Hand injuries 007 T: What are the main causes of hand injuries in the workplace? Q: Hand Injuries in the work place 008 T: Hand injuries Q: hand injuries 009 T: Job related hand injuries Q: job related hand injuries 010 T: Hand injuries among different occupational levels Q: hand injuries occupational 011 T: Workplace accidents Q: workplace accidents 012 T: Hand injuries Q: hand injuries 013 T: Hand injuries at workplaces and how to combat the injuries in the most efficient way Q: how a plastic injection moulding machine works 014 T: Hand injuries Q: hand injuiries 015 T: Occupational accidents involving hands Q: Occupational Accidents involving hands 016 T: Occupational hand and wrist injuries Q: workplace hand injuries 017 T: Jobs that have high hand-risk for injuries and their cost to business Q: high hand risk occupations

Continued

Table 4.10: List of search topics and initial queries

135 Table 4.10 continued

018 T: Hand injuries caused by occupational accidents in work place Q: hand injuries in work place 019 T: Hands, and likely accidents Q: preventative steps ruducing hand injuries in workplace 020 T: Workplace injuries Q: workplace injuries 021 T: Cost of hand injuries in the work place Q: cost of hand injuries in the work place 022 T: Types of equipment that causes hand injuries Q: equipment hand injuries 023 T: The incidence of of the hand or forearm in developed countries Q: occupational injury hand forearm incidence 024 T: Work-related injuries Q: hand injury work 025 T: Machinery can cause injuries in the workplace Q: injuries workplace machinery 026 T: Number of hands/fingers are lost in work related injuries Q: work related injuries 027 T: Most of common cause of hand injuries sustained by industrial workers Q: Common causes of hand injuries sustained by industrial workers 028 T: Risk factors of hand injuries at work place Q: risk of hand injuries at work 029 T: Occupational injury - Hand injuries - the comparison of restaurant workers and non-restaurant workers Q: occupational hand injuries 030 T: Hand injury Q: hand injury 031 T: Possible occupational injuries in restaurant industries Q: occupational injuries 032 T: How to prevent hand injuries in kitchen or other workplaces Q: hand injury workplace kitchen 033 T: Impact of work injury - hand cut to the company vs. employees Q: work compensation 034 T: Estimating economic costs of hand injuries caused by plastic molding machine Q: workplace injuries hands 035 T: Hand injuries in the workplace Q: hand injuries work place

Continued

136 Table 4.10 continued

036 T: Worker's compensation from hand injuries Q: workers compensation hand injuries 037 T: Hand injuries caused by accidents in a workplace environment. How can improved design reduce the number of on the job injuries & will that save money? Q: hand injuries kitchen 038 T: Causes of hand injuries and prevention ideas Q: prevention of hand injuries 039 T: Cost of occupational injuries for manufacturing industry Q: cost occupational injury manufacturing 040 T: Hand safety takes an important role among industries & food business Q: hand safety industries 041 T: What are the main causes of hand injuries while on the job? Q: causes of hand injuries on the job 042 T: Prevalence of hand injury in manufacturing & resulting costs Q: Occupational Hand Injury Prevalence 043 T: How to protect your hand in your workplace Q: how to protect your hand in your workplace 044 T: How to prevent accidents or injuries at our workplace? Q: how to prevent injuries or accidents at our workplace 045 T: Hand injury in the workplace Q: hand injury workplace 046 T: The most common injury in workplace Q: occupational injuries 047 T: The rise of hand injuries in the workplace and prevention steps that are being taken to prevent future occurrences Q: workplace hand injury prevention 048 T: The effect of hand injuries on the U.S. economy Q: hand injuries effect economy 049 T: Work place hand injury Q: workplace hand injury 050 T: How in-work accidental rate such as injuries in IM machine operator compared to kitchen workers Q: injection molding injuries 051 T: Hand injuries in the workplace Q: workplace hand injuries 052 T: Costs involved in work place injuries Q: costs involved in workplace injuries

Continued

137 Table 4.10 continued

053 T: Hand injuries Q: hand injuries 054 T: Different hand injuries levels according to different workplaces Q: hand injuries 055 T: Occupational hand injuries Q: occuptational injuries 056 T: Statistics in hand injury at work Q: statistics hand injury work 057 T: Occupational hand injuries Q: occupational hand injuries 058 T: Hand injuries in industrial accidents Q: hand injuries in industrial accidents 059 T: Consequences of occupational injuries to the hand Q: occupational hand injury consequences 060 T: On the job cuts on the rise in plastic moulding machines Q: cuts plastic moulding machines

Histogram for the length of initial queries

12 11 10 9 8 8 7 6 6 5 5 4 3 Frequency 2 2 1 1 1 1 0 0 0 0 2 3 4 5 6 7 8 9 Length of Initial Queries

Control Group Treatment Group

Figure 4.13: Histogram depicting the length of initial queries by groups (from Excel)

138

Figure 4.14: A graphic summary of the average length of initial queries (from Minitab)

Jensen, Chowdhury, Frieder, and Grossman (2007). Both studies were based on search engine Web query logs with various time lengths and myriad users. Jansen and Pooch

(2001) reported an average query length of 2 terms for surveying three Web-searching studies. The search engines included in the three studies were: Fireball

(http://www.fireball.de/) with a log length of 31 days in 1998, Excite

(http://www.excite.com/) with a log length of a portion of one day in 1997, and Alta

Vista (http://www.altavista.com/) with a log length of 43 days in 1998. Beitzel, Jensen,

Chowdhury, Frieder, and Grossman (2007) analyzed two query logs from America

Online (http://www.aol.com/). The average query length was 2.2 terms from the first log

139 with a log length of 1 week in 2003 and 2.7 terms from the second log with a log length of 6 months (2004 to 2005).

Because some of the participants’ topics focused on the “costs” (i.e., compensation, cost, economy, risk, statistics), and “prevention or protection” (i.e., cause, design, prevention, protection) of occupational injuries, an additional test was run using only participants whose topic consisted of occupational hand injuries (by removing those topics on the “costs” and “prevention or protection” as mentioned above). The hypothesis was that there was no difference in expected ranking across participants in the control group vs. the treatment group with topics strictly focused on occupational injuries. There were 19 and 23 such topics (participants) in the control group and in the treatment group, respectively, with 40 observations per participant. Using an F statistic to test this hypothesis, this hypothesis was rejected at p < 0.001. The average effect (coefficient) was β = 0.85 (Range from -0.31 to 2.66) in this control group and β = 1.05 (Range from

-0.21 to 2.71) in this treatment group. The average rating in the control group was 2.89 with a standard deviation 0.76 and a range from 1.725 to 4.700. The average rating in this treatment group was 3.09 with a standard deviation 0.77 and a range from 1.825 to 4.750.

4.3.3.2 Number of searches

To test whether SAOA-1 had an influence on the searcher’s search behaviors, the number of searches submitted by each participant was used as an indication of the search behavior. The results indicated that there was a significant difference in the number of searches submitted, t(58) = -2.26, p = 0.014, with the treatment group ( M = 4.53, SD =

2.91) submitting more searches than the control group ( M = 3.00, SD = 2.30). Both the 140 treatment group and the control group had a range from 1 to 11 searches. Figure 4.15 shows a histogram depicting the number of searches submitted by group.

Histogram for Number of Searches

10 9 8 8 7 6 6 4 4 3 3 3 3 3

Frequency 2 2 2 2 1 1 01 0 01 0 1 0 1 2 3 4 5 6 7 8 9 1011 Number of Searches

Control Group Treatment Group

Figure 4.15: Histogram depicting number of searches by group (from Excel)

4.3.3.3 Search patterns

The third analysis regarding search behavior involved identifying participant search patterns. In this study we focused on revealing search submission patterns (the methodology used to submit searches during the search process). In the control group, there was only one method (the New Search). This method allowed the participant to submit a search from the “Search” button at the top of the page. In other words, if a participant wanted to refine the search, there was only one way to do so. Therefore, the search patterns in the control group would be New Search alone or repeated New

Searches if more than one search was performed. In the treatment group, however, there

141 were four different ways to submit a search. These included New, Original, Alternative, and Modified Searches. Therefore, the search patterns for the treatment group could be any combination of the four search options. All searches began with a New Search. In the treatment group, three participants were satisfied with results produced with the first search. For these three participants, the search pattern would be New Search. Eight participants were satisfied with the second search. Among these eight participants, two of the participants also used the New Search for their second query. Thus, the search pattern for these two participants would be New Search – New Search. A single participant tried the Alternative Search for his second search. Thus, the search pattern for this participant would be New Search – Alternative Search. The remaining five participants used

Modified Search to continue searching. Thus, the search pattern for these five participants would be New Search – Modified Search. Table 4.11 shows the search patterns sorted by the number of searches per participant in the control group. Because all searches in the control group were submitted using the (New) “Search” button, the search patterns are all

New Searches with varying search numbers. Table 4.12 shows search patterns (search methods submitted) sorted by the number of searches per participant in the treatment group. Among the thirty participants in the treatment group, there were twenty-two different search patterns. It is obvious that as the number of searches increased, so did the search pattern variety.

Table 4.13 reorganizes the search patterns in the treatment group by separating them into four different types (sorted by the number of searches per participant). Type A contains only New Searches. Type B involves search patterns ending with Alternative

Search and Type C designates patterns ending with Modified Search. Type D indicates

142 No Participants Search Patterns Number of Searches 1 020 N 1 2 028 N 1 3 032 N 1 4 035 N 1 5 044 N 1 6 046 N 1 7 047 N 1 8 049 N 1 9 058 N 1 10 006 N N 2 11 019 N N 2 12 029 N N 2 13 048 N N 2 14 052 N N 2 15 053 N N 2 16 003 N N N 3 17 009 N N N 3 18 021 N N N 3 19 034 N N N 3 20 041 N N N 3 21 043 N N N 3 22 050 N N N 3 23 004 N N N N 4 24 042 N N N N 4 25 002 N N N N N 5 26 005 N N N N N 5 27 030 N N N N N 5 28 007 N N N N N N 6 29 023 N N N N N N N N 8 30 013 N N N N N N N N N N N 11

Table 4.11: Search patterns in the control group (sorted by the number of searches per participant) Note: N = New Search

143 No Participants Search Patterns Number of Searches 1 018 N 1 2 051 N 1 3 056 N 1 4 039 N N 2 5 040 N N 2 6 057 N A 2 7 012 N M 2 8 014 N M 2 9 017 N M 2 10 027 N M 2 11 036 N M 2 12 045 N A N 3 13 059 N A N 3 14 011 N M N 3 15 022 N N N N 4 16 038 N N M N 4 17 031 N M N N N 5 18 016 N M O A O 5 19 025 N M M O M 5 20 033 N N A O N A 6 21 060 N A N M N A 6 22 024 N M N M N M 6 23 010 N M M M M M 6 24 026 N M N A O M O 7 25 037 N M A N M M N 7 26 015 N M M N A M N 7 27 008 N A M N M N N M O 9 28 001 N A M M N N M M A M 10 29 054 N M M M M N M N M N 10 30 055 N M N N N M M M M M O 11

Table 4.12: Search patterns in the treatment group (sorted by the number of searches per participant) Note: N = New Search, O = Original Search, A = Alternative Search, M = Modified Search

144 No Participants Search Patterns Number of Searches Type A: Using only the New Search 1 018 N 1 2 051 N 1 3 056 N 1 4 039 N N 2 5 040 N N 2 6 022 N N N N 4

Type B: Ending with the Alternative Search 1 057 N A 2 2 033 N N A O N A 6 3 060 N A N M N A 6

Type C: Ending with the Modified Search 1 012 N M 2 2 014 N M 2 3 017 N M 2 4 027 N M 2 5 036 N M 2 6 025 N M M O M 5 7 010 N M M M M M 6 8 024 N M N M N M 6 9 001 N A M M N N M M A M 10

Type D: Containing the Original Search, the Alternative Search, or the Modified Search excluding ending with the Alternative Search (Type B) or the Modified Search (Type C) 1 011 N M N 3 2 045 N A N 3 3 059 N A N 3 4 038 N N M N 4 5 016 N M O A O 5 6 031 N M N N N 5 7 015 N M M N A N N 7 8 026 N M N A O M O 7 9 037 N M A N M M N 7 10 008 N A M N M N N M O 9 11 054 N M M M M N M N M N 10 12 055 N M N N N M M M M M O 11

Table 4.13: Four different types of search patterns in the treatment group (sorted by the number of searches per participant) Note: N = New Search, O = Original Search, A = Alternative Search, M = Modified Search

145 search patterns containing Original, Alternative, or Modified Search but excluding searches ending with Alternative (Type B) or Modified (Type C) Searches. For readability, the screen images presented later in this section are all partially enlarged and as a result only display the first seven documents retrieved from each search.

The “Run Original Search”, “Run Alternative Search,” “Show Modified Search,” and “Run Modified Search” buttons are the four executable buttons provided by SAOA-1 to assist with the search. Type A refers to the group of participants who did not click on any of the four executable buttons even though they were exposed to the treatment

(SAOA-1) with domain-specific knowledge. Six participants in the treatment group did not click on any of the four executable buttons during their searches. Three of them were satisfied with the results of their first search. Two of them submitted searches using the

New Search twice. One participant, participant 022, repeated the New Search four times.

So participant 022’s search pattern was New Search – New Search – New Search – New

Search. Figures 4.16 to 4.19 show screen images of participant 022’s search results from each of the four searches. Participant 022’s topic was “types of equipment that cause hand injuries” and the four search queries were 1. “equipment hand injuries,” 2. “types equipment hand injuries,” 3. “types equipment hand injuries occupational,” and 4.

“equipment most likely hand injuries.” During the search process, participant 022 did not explore any of the information provided in the Left section of the Results Page. If the participant would have explored the suggested concepts in the Source of Injury class, the participant might have learned that some equipment listed under the Source of Injury might be equipment of interest or provide clues for various types of equipment relevant to the search.

146

Figure 4.16: A screen image of participant 022’s first search outcome (partially enlarged for readability)

147

Figure 4.17: A screen image of participant 022’s second search outcome (partially enlarged for readability)

148

Figure 4.18: A screen image of participant 022’s third search outcome (partially enlarged for readability)

149

Figure 4.19: A screen image of participant 022’s fourth (final) search outcome (partially enlarged for readability)

150 Type B involves search patterns ending with the Alternative Search. Three participants produced three different search patterns (See Table 4.13). Participant 057 stopped the search at the second search attempt with a search pattern New Search –

Alternative Search. Figures 4.20 and 4.21 show the screen images of participant 057’s resulting document pages. Participant 057’s topic was “occupational hand injuries,” which was also used as the original search query.

Type C encompasses search patterns ending with the Modified Search. Nine participants produced five different search patterns (See Table 4.13). Five participants including participant 012, stopped the search at the second search and shared the same search pattern New Search – Modified Search. Figures 4.22 and 4.23 show the screen images of participant 012’s two search results. His topic was “hand injuries,” which was also used as the search query. In addition to the original search query “hand injuries,” participant 012 also selected concepts, “/Contusion/”, “ Burn (Chemical),”

“B urn/Scald (Heat),” “Cut/Laceration,” and “Punctures” from the Nature of Injury class and the concepts “Fire/Smoke,” “Machine,” “Materials Handling,” and “Working

Surface” from the Source of Injury class to form a Modified Search and was satisfied with the Modified Search results.

Type D encompasses search patterns from the group of participants who clicked at least one of the four executable buttons during their search process even though the final results were not produced by the Alternative Search (Type B) or the Modified

Search (Type C). Combining Types B, C and D, we conjecture that this cumulative group obtained domain-specific knowledge provided by SAOA-1 to assist with the search, regardless of whether or not the final results were produced using the Alternative Search

151

Figure 4.20: A screen image of participant 057’s first search outcome (partially enlarged for readability)

152

Figure 4.21: A screen image of participant 057’s second (final) search outcome (partially enlarged for readability)

153

Figure 4.22: A screen image of participant 012’s first search outcome (partially enlarged for readability)

154

Figure 4.23: A screen image of participant 012’s second (final) search outcome (partially enlarged for readability)

155 or Modified Search. In Type D alone, there are eleven different search patterns produced by twelve participants (See Table 4.13). Participants 045 and 059 are the only two participants who shared the same search pattern New Search – Alternative Search – New

Search. Participant 015, with a search pattern New Search – Modified Search – Modified

Search – New Search – Alternative Search – New Search – New Search, could be an example showing how SAOA-1 led her to the final outcome even though it was not produced by means of the Alternative Search or the Modified Search.

Figures 4.24 to 4.30 show screen images of the search outcomes produced by each of Participant 015’s seven searches. Participant 015’s topic was “occupational accidents involving hands”, which was also used as the first search query (Figure 4.24).

For the second search, the participant explored the “Suggested Refinements for the

Alternative Search” and selected concepts to form a Modified Search. From the Nature of

Injury class the participant selected: “Bruise/Contusion/Abrasion,” “ Burn/Scald (Heat),” and “Cut/Laceration.” Concept selections from the Source of Injury class were

“Fire/Smoke,” “Machine,” “Materials Handling,” and “Working Surface” (Figure 4.25).

For the third search, the participant modified the concepts in the Source of Injury class by adding the concepts “Hand Tool (Manual)” and “Hand Tool (Powered)” and removing the concept “Materials Handling” from the previous search query (Figure 4.26). The participant then started a New Search using the search query “Hand injuries in the work place” to initiate the fourth search (Figure 4.27) followed by submitting the Alternative

Search (Figure 4.28). The participant started another New Search using the search query

“hand injuries in the work place” (participant could have instead retrieved the outcomes by clicking on the “Run Original Search” button) (Figure 4.29). From this set of

156

Figure 4.24: A screen image of participant 015’s first search outcome (partially enlarged for readability)

157

Figure 4.25: A screen image of participant 015’s second search outcome (partially enlarged for readability)

158

Figure 4.26: A screen image of participant 015’s third search outcome (partially enlarged for readability)

159

Figure 4.27: A screen image of participant 015’s fourth search outcome (partially enlarged for readability)

160

Figure 4.28: A screen image of participant 015’s fifth search outcome (partially enlarged for readability)

161

Figure 4.29: A screen image of participant 015’s sixth search outcome (partially enlarged for readability)

162

Figure 4.30: A screen image of participant 015’s seventh (final) search outcome (partially enlarged for readability)

163 document links, the participant clicked and read the second document, went back to the search results, scrolled down the page, and scanned the forty documents. The participant continued by starting a New Search using a brand new query “repetitive stress injuries,” and then stopping the search altogether (Figure 4.30).

The second document contained within the results of the sixth search was the first and the only document investigated by the participant. This document reported legal cases about workplace disabilities on the hand injuries known as “repetitive stress injuries” (RSI) (Lorch, 1992). Including the second document mentioned above, a total of ten outcomes (25%) of the forty-document set were related to the topic of “repetitive stress injuries.” Table 4.14 displays the descriptions related to “repetitive stress injury” and the Web addresses of those ten results.

Note that participant 015’s topic was “occupational accidents involving hands,” which was also the initial query. The final query was “repetitive stress injury.” The term

“repetitive stress injury” or “repetitive strain injury” is formally known as “work-related upper limb disorder (WRULD)” (Larkin, 2000; van Eijsden-Besseling, Peeters, Reijnen,

& de Bie, 2004). It is obvious that “repetitive stress injury” was not the participant’s initial search focus. The participant’s query had changed from a general concept of work- related accidents to hand (“occupational accidents involving hands”) to a specific type of work-related hand injury (“repetitive stress injury”). The participant might have learned additional knowledge (“repetitive stress injury”) regarding work-related hand injuries form SAOA-1 during the search process, which led to her final search based on a modification of the original topic. However, there is no direct evidence to support this claim.

164 Document Description related to repetitive stress injury / Number Document web address 2 (None but was the article read by the participant) query.nytimes.com/gst/fullpage.html?res= 9E0CE2DA173AF930A35755C0A964958260 3 (Different URL but the same article as Document Number 2) query.nytimes.com/gst/fullpage.html?res= 9E0CE2DA173AF930A35755C0A964958260&sec=&spon= 4 repetitive motion injuries www.gag.org/resources/rmi.php 7 www.aiha.org/Content/AccessInfo/consumer/ AnErgonomicsApproachtoAvoidingWorkplaceInjury.htm 9 hand forces and wrist motions for preventing injuries ... ieeexplore.ieee.org/iel2/128/2378/00066325.pdf 10 repetitive stress injury (RSI) findarticles.com/p/articles/mi_m0675/is_1_19/ai_69651754 15 repetitive ... hand and wrist www.feldenkrais.com/method/article/ a_systemic_approach_to_workplace_injury/ 34 repetitive exertions of the hand www.convertingmagazine.com/article/CA601203.html 35 work-related repetitive hand ..... repetitive motion injurywww.nap.edu/openbook.php?record_id=10032&page=439 36 ...Injuries (RSIs) of the wrist www.springerlink.com/index/w351170354005614.pdf

Table 4.14: Ten search outcomes related to repetitive stress injury from the first forty results of participant 015’s sixth search

4.3.4 Usability of the Interface

Likert scale (1-5 scale range) and open-ended questions were used to evaluate the usability of the interface.

Questions 1 to 4 in the treatment group are Likert questions inquiring about how much the participant agreed or disagreed with each of the following statements:

1. It was easy to use this search tool.

2. This search tool helped me to better refine my searches. 165 3. This search tool helped me to better understand what I was retrieving.

4. I was successful in finding the information I need to write my term paper.

Afterward, each participant indicated his or her satisfaction with SAOA-1 using the

Likert scale (1 = strong disagreement, 5 = strong agreement). Table 4.15 shows a summary of the descriptive statistics reflecting each of the four-question ratings in the treatment group. All four questions had at least an average rating score of 4.2 with ranging from 2 to 5 on the Likert scale. Figure 4.31 depicts the treatment group statement ratings of questions 1 to 4 using a histogram.

Total Variable Q Count Mean SE Mean StDev Variance Minimum Maximum Range Score Q1 30 4.433 0.141 0.774 0.599 2.000 5.000 3.000 Q2 30 4.200 0.147 0.805 0.648 2.000 5.000 3.000 Q3 30 4.233 0.164 0.898 0.806 2.000 5.000 3.000 Q4 30 4.633 0.102 0.556 0.309 3.000 5.000 2.000

Table 4.15: Descriptive statistics for Questions 1 to 4 in the treatment group (from Minitab)

Questions 1 and 2 in the control group posed the same questions as questions 2 and 4 in the treatment group. Table 4.16 shows a summary of the descriptive statistics concerning the ratings for each of the two questions in the control group. Question 1 had an average rating score of 3.633 with a range from 1 to 5 on the Likert Scale. Question 2 had an average rating score of 4.633 with a range from 3 to 5 on the Likert Scale. Figure

4.32 depicts the control group statement ratings of questions 1 and 2 using a histogram.

166

Figure 4.31: Histograms depicting Likert Scale rating of Questions 1 to 4 in the treatment group (from Minitab)

Total Variable Q Count Mean SE Mean StDev Variance Minimum Maximum Range Score Q1 30 3.633 0.217 1.189 1.413 1.000 5.000 4.000 Q2 30 4.633 0.112 0.615 0.378 3.000 5.000 2.000

Table 4.16: Descriptive statistics for Questions 1 to 2 in the control group (from Minitab)

Comparing the rating scores between question 2 in the treatment and question 1 in the control group, the expected rating score in the treatment group ( M = 4.20, SD = 0.81), is significantly greater then the expected rating score in the control group ( M = 3.63, SD

= 1.19), t(58) = -2.16, p = 0.035). This constitutes evidence supporting the fact that participants perceived SAOA-1 as a way to better help them refine their searches. 167

Figure 4.32: Histograms depicting Likert Scale rating of Questions 1 and 2 in the control group (from Minitab)

Upon comparison of question 4 (treatment group) and question 2 (control group) ratings, there is no significant difference between the expected rating score of the treatment group ( M = 4.63, SD = 0.56) and the control group ( M = 4.63, SD = 0.62), t(58)

= -0.00, p = 1.000. In other words, participants in both groups perceived themselves as being successful in finding the information they needed to write their term paper.

Questions 5 to 8 in the treatment group requested a participant to response to the following open-ended questions regarding the search tool:

5. List the most negative aspect(s).

6. List the most positive aspect(s).

168 7. What aspects of this search tool are especially important?

8. Additional comments.

Question 3 in the control group solicited each participant for additional comments, if any.

All the answers to these open-ended prompts are listed in Appendix D. A brief summary of their responses are described below.

Regarding question 5, the most negative aspect(s) of SAOA-1, four participants did not supply of any negative feedback. One participant wrote that the interface was too simple. Four participants stated that they were not familiar with the interface because of the screen format (search query details and key information were located on the left side of the screen). Of these four, one participant indicated that the search tool was different from what he had used in previous Internet searches. Another participant thought the side bar was confusing. (She later explained that she uses Google and Google does not have a side bar.) Two participants described that they were not conditioned to look for details on the left side of the screen. Five participants commented on the limitations of SAOA-1’s search functions, such as the inability to use various advanced search functions (“AND”,

“OR”, and quotation marks), which are available when using commercial search engines.

Regarding question 6, the most positive aspect(s) of SAOA-1, twenty-three (77%) participants positively commented on the assistance the search tool provided in refining their queries. 27% of them remarked on how easy it was to use. One of these participants noted that, “The suggested terms gave me new ideas of other related searches I wouldn't otherwise have thought of.” Another participant indicated that the search tool saved her time by picking keywords. Two participants stated that there were no ads or other

169 distracting things on the page. Three participants commented that the interface was simple, clean, and easy to read.

Regarding question 7, addressing the valuable aspects of SAOA-1, sixteen (53%) participants commented that the assistance on search refinement was especially important.

Among these sixteen participants, six (20%) participants particularly indicated that the relevancy of the alternative search function was especially important. One participant stated that SAOA-1 provided a quick history of what (queries) he had tried. Another participant said that, “(The agent) teaches users that different search terms bring different results (for better or worse). Many casual search engine users don't realize this.” Three participants reported that the ease of use was important to them. Five participants stated that the relevance or accuracy of results or reliable sources were especially important.

170

CHAPTER 5

DISCUSSION AND CONCLUSIONS

5.1 Introduction

In this study, we developed and evaluated a prototype of a topic-oriented search agent, SAOA-1, to assist IR on the Internet. SAOA-1 was developed with embedded semantics and domain-specific knowledge extracted from OSHA’s Accident

Investigation Search databases that were not intended to support Internet search. This relational database contains information about semantics relevant to the domain of

“Occupational Accidents.” Conceptual classes, concepts, and semantic relationships among concepts were extracted from the database and embedded in SAOA-1 to assist searching. If activated, SAOA-1 assists information seekers by suggesting appropriate refinements and providing user interactions enabling manipulation of suggested refinements.

The rest of this chapter is organized as follows. Section 5.2 provides a summary of research findings. Section 5.3 addresses the significance of this study. Section 5.4 discusses the limitations of this study. Finally, Section 5.5 suggests possible directions for extending this research.

171 5.2 Summary of Research Findings

As discussed earlier, we developed a functional prototype, SAOA-1, that demonstrated the following design concepts:

a. Activates upon keyword recognition (related to occupational accidents)

b. Constructs an “Alternative Search” using search engine knowledge

c. Applies semantics and knowledge extracted from a relational database

d. Identifies conceptual relationships and focuses attention on related concepts

e. Suggests refinements for the “Alternative Search” to users

Note that, although the focus was on semantic representations inferred from OSHA’s

Accident Investigation Search database, the approach is domain independent. A similar search assistance agent could be developed for any domain that has an established relational database with associated indexing terms that has been used to index documents in that domain. Some examples of domains include: health (Medline Plus, http://www.nlm.nih.gov/medlineplus/medlineplus.html, provided by NLM and National

Institutes of Health), education resources (e.g., Thesaurus, ERIC), and movies (e.g.,

IMDB), etc. For example, the IMDB main page of each movie title contains indexing terms such as Director, Writers, Genre, Plot Keywords, Cast, Sound Mix, Certification, etc. and associated terms (i.e., concepts).

In addition to developing this prototype system, an empirical study was conducted to evaluate its impact on search performances.

172 5.2.1 Related Studies

Greenberg (2001) examined “whether QE via semantically encoded thesauri techniques is more effective in the automatic or interactive processing environment”. She found that “synonyms and narrow terms are generally good candidates for automatic

QE”, and “related terms are better candidates for interactive QE”. In SAOA-1, the initial query was automatically expanded as an Alternative Search, incorporating for synonym searches using search engine knowledge. In addition, we provided an un-pruned list of terms for broadening the search and a pruned list of terms based on the existence of semantic relationships (as inferred from the OSHA database) for narrowing the search as the Modified Search. In part based an guidance from Greenberg’s study, these broader and narrower terms were presented to the user to support interactive query expansion.

Imai, Collier, & Tsujii (1999) suggested performing query reformulation using thesaurus based expansion and a local feedback method in sequence to reformulate queries. In SAOA-1, the lists of suggested refinements were generated dynamically based on the user’s query. In addition, users were able to select additional terms suggested and re-run the query to generate a new set of results as did in the study of Foo, Hui, Lim, and

Hui (2000). The general nature of interaction with SAOA-1 is thus consisted with this study as well, permitting the user to make judgments about whether and when to expand or change the query.

5.2.2 Main Effects

The key question of this empirical study was whether the functions embedded in the prototype system, SAOA-1, improved search results related to Google keyword

173 searches. Based on the test results, the treatment effect, i.e., SAOA-1, (M = 3.17, SD =

0.85, p < 0.001) was statistically significant and greater on the expected rating than that of the control group ( M = 2.88, SD = 0.72) (See Figure 4.11). From a practical perspective, however, the size of the improvement was modest (10%).

The document number ( p < 0.001) and the document number squared ( p < 0.001) were also found statistically significant as expected. The negative value of the combined estimated effect on the expected rating indicated that as the document number increases, the expected rating decreases. In other words, the top ranked document was expected to be more relevant than lower ranked documents.

For the group of participants whose final results were produced using the

Alternative Search, the sample size (3 participants) was too small to draw a conclusion for this group.

For the group of participants whose final results were produced using the

Modified Search, the expected rating was statistically significantly different from that of the control group (p < 0.001). On average, the effect of this group ( M = 3.13, SD = 0.66) was greater than the effect of the control group ( M = 2.88, SD = 0.72).

Finally, for the group of participants who clicked on any of the four executable buttons provided by SAOA-1 for search assistance during the search processes, the expected rating was statistically significantly different from the control group (p =

0.0186). On average, the effect of this group ( M = 3.12, SD = 0.84) was greater than the effect of the control group ( M = 2.88, SD = 0.72).

174 5.2.3 Search Behaviors

The average length of initial queries, the number of searches submitted, and search patterns were analyzed to study search behaviors.

The average length of the initial queries submitted by all sixty participants was

3.9 terms with a range from 2 to 9 terms. Comparing the length of initial queries between the control group and the treatment group, the test results suggested that the length of the initial queries in the control group ( M = 4.40, SD = 2.01) was statistically significant longer than that in the treatment group ( M = 3.23, SD = 1.38), t(51) = 2.62, p = 0.006. It is not clear whether the lengths of the initial queries were influenced by the treatment,

SAOA-1. Because participants in the treatment group had been presented with the demonstration of SAOA-1, the participants might expect to receive search assistance from SAOA-1 and thus as a result submitted shorter queries. However, there was no direct evidence to support this argument.

The number of searches submitted by each searcher was used as an indication of the searcher’s search behavior. In the treatment group, participants submitted an average number of 4.5 searches ( M = 4.53, SD = 2.91) compared to the average number of 3.0 searches in the control group ( M = 3.00, SD = 2.30). The test results indicate that there is statistically significant difference between the two groups, t(58) = -2.26, p = 0.014. In other words, SAOA-1, the treatment, had an influence on the number of searches submitted by the participants.

The search patterns focus on how the searches were submitted in the study, that is, the pattern of the methods used to submit searches during the search processes. In the control group, only the method New Search was available and thus the patterns were

175 either a single New Search or repeated New Searches. The patterns were found with a frequency from one to eleven searches per person (Figure 4.14). In the treatment group, the patterns were formed with a combination of New Search, Original Search, Alternative

Search, and Modified Search. The thirty participants in the treatment group produced twenty-two different search patterns with a frequency from one to eleven searches per person (Figure 4.16). It is obvious that as the number of searches increased, so did the search pattern variety.

5.2.4 Usability of the Interface

All the four Likert scale questions (1 = strongly disagree; 5 = strongly agree) in the treatment group received at least an average rating score of 4.2 with a range from 2 to

5 (Table 4.15, Figure 4.32). The expected ratings on the usability indicated, on average, participants agreed that SAOA-1 was easy to use and helped users to better understand what they were retrieving and to better refine their searches.

There was evidence that participants in the treatment ( M = 4.20, SD = 0.81) perceived the agent helped them to better refine their searches than those in the control group ( M = 3.63, SD = 1.19), t(50) = -2.16, p = 0.035.

Participants in both the treatment group ( M = 4.63, SD = 0.56) and the control group (M = 4.63, SD = 0.62), on average perceived themselves as being equally successful in finding the information they needed to write their term paper regardless of their test group, t(58) = -0.00, p = 1.000.

176 5.3 Significance of the Study

As discussed earlier, there have been a limited number of studies studying methods for developing and using semantic representations to support query expansion for searching the Web (Araujo & Perez-Aguera, 2006; Chen, Ho, & Yang, 2006; Foo,

Hui, Lim, & Hui, 2000; Greenberg, 2001; Imai, Collier, & Tsujii, 1999; Mandala,

Tokunaga, & Tanaka, 2000; Riekert, 2002; Tudhope, Binding, Blocks, & Cunliffe, 2006;

Varelas, Voutsakis, Raftopoulou, Petrakis, & Milios, 2005). These previous studies found modest improvements in recall (17% to 20%) or precision (13% to 30%). They further indicated that automatic query expansion is best suited for increasing recall by including the children of a concept in a thesaurus, while user-controlled expansion is best for refining the query to focus on related topic.

The work reported here extends this work in three ways. First, it demonstrates how important semantic relationships that already exist in Web-accessible databases

(specifically tabular information) can be automatically processed to support semantically- based query expansion. Second, it demonstrates how search engine-specific knowledge can be used to provide a context to focus the application of query expansion techniques.

Third, it shows how context-sensitive pruning can be accomplished using this semantic representation in order to help the user focus attention on relevant ways to expand a query. All three of these techniques were demonstrated in a prototype system (SAOA-1).

The results of the empirical study showed modest improvements (10%) in the quality of search results when SAOA-1 was used to retrieve information on the Web.

While information seekers found this user-controlled approach to query expansion to be usable, further studies are needed to understand how and when the particular features can

177 enhance performance. Of particular interest is the question of whether this approach is useful in helping an information seeker to refine his or her topic, rather than simply improve recall or precision on a fixed topic.

5.4 Limitations of the study

Limitations of this study are discussed as follows.

1. SAOA-1, acting as a search agent, used Google to retrieve documents from the

Internet and adopted Google’s rankings on documents for retrieval. Thus, the order of documents retrieved by SAOA-1, and the relevance of these documents, was influenced by Google’s relevance ranking algorithms.

2. SAOA-1 is a functional prototype of a search agent. As such, it is not yet equipped with additional search operators (e.g., “AND” or “OR”) nor additional algorithms to deal with stop words, ranking, etc. at this stage.

3. There are many different formats of information on the Internet. Currently

SAOA-1 is focusing on retrieving documents only. Also, SAOA-1 is incorporated with

Internet Explorer browser on PC and has not been tested on other systems.

4. One kind of “Occupational Accidents” knowledge provided by SAOA-1 for participants to broaden or narrow their search is the concepts under the category Part of

Body. However, participants did not broaden or narrow the searches by applying the knowledge provided under the category Part of Body as might be expected. This might be because the two articles provided addressed hand injuries, instead of other body parts, in the workplace. Thus, the participants’ search topics were limited to hand injuries.

178 Participants then focused the concept of Part of Body “hand” instead of including to other concepts of Part of Body such as fingers.

5. Conceptually, a relational database with rich contents, levels of hierarchies, and easy for data extracting would be an ideal candidate for supporting a search agent like

SAOA-1. The semantics and knowledge extracted from OSHA were flat lists without additional levels of hierarchies. Therefore, the generalization of SAOA-1 can only be applied to such data structure with limited richness.

5.5 Future Research

SAOA-1 is a prototype of a topic-oriented search agent. It takes advantage of the semantics and knowledge extracted from a relational database that was built for purposes other than assisting IR on the Internet. Other possible directions for extending this research are described as follows.

1. Unlike an all-purpose search engine, SAOA-1 serves as an add-on to further enhance an all-purpose search engine. By identifying proper databases, more topic- oriented search agents might be built to assist Internet searches by extracting the semantics and domain-specific knowledge from the database. Different agents could assist searches on different topics.

2. SAOA-1 incorporates Google’s search knowledge, i.e., advanced operators and features, and uses Google to retrieve documents from the Internet. Technically, such a search agent may be able to retrieve documents from more than one search engine (with the individual search engine’s knowledge incorporated), evaluate the retrievals, and then present the retrievals.

179 3. When conducting the empirical study, the participants were asked to decide upon a search topic prior to performing search tasks. For a future study, the design of the empirical study could allow participants to change their search topics while performing search tasks.

4. Although, the semantics and knowledge embedded in SAOA-1 were built as flat lists, the model could be further expanded to construct search agents with concepts at different levels and complexities of hierarchies.

5. We found that the average length of the initial queries in the control group was significantly longer than that in the treatment group. Further studies could be conducted to identify whether this behavior was affected by the treatment.

6. There was only modest improvement due to the treatment. Further studies could be conducted to identify which feature(s) was the main contribution to the improvement or what could be done to make further improvement. Additional features to investigate could include the decision to prune the lists and the advanced operators to be included in the canned query.

180

LIST OF REFERENCES

About ERIC - Overview – ERIC . (2008). Retrieved June 30, 2008, from http://www.eric.ed.gov/ERICWebPortal/resources/html/about/about_eric.html

Accident: 14314603 - 79-Year-Old Employee Killed, 9 Injured In Gasoline Explosion . (1987). Retrieved March 11, 2008, from http://www.osha.gov/pls/imis/accidentsearch.accident_detail?id=14314603

Advanced Operators . (2008). Retrieved March 26, 2008, from http://www.google.com/help/operators.html

Advanced Search Made Easy . (2008). Retrieved March 11, 2008, from http://www.google.com/support/bin/static.py?page=searchguides.html&ctx=advanced&h l=en

Aiyer, P., Vij, V., & Rana, R. (1995). An unusual injury of the hand in a plastic injection moulding machine. Indian Journal of Plastic Surgery, 28(1), pp. 31-32. antonymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s=antonymy

Araujo, L. & Perez-Aguera, J.R. (2006) Enriching thesauri with hierarchical relationships by pattern matching in dictionaries. Lecture Notes in Computer Science, 4139, pp. 268- 270.

Ask Search Technology . (2008). Retrieved May 8, 2008, from http://help.ask.com/en/docs/about/ask_technology.shtml

Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval . New York: ACM Press.

Beitzel, S.M., Jensen, E.C., Chowdhury, A., Frieder, O., & Grossman, D. (2007). Temporal analysis of a very large topically categorized Web query log. Journal of the American Society for Information Science & Technology , 58(2), pp. 166-178.

181 Berners-Lee, T., Hendler, J., Lassila, O. (2001). The semantic web. Scientific American, 284(5), pp. 34-43.

Berry, M.W. & Browne, M. (2005). Understanding Search Engines: Mathematical Modeling and Text Retrieval (2 nd ed.). SIAM: Society for Industrial and Applied Mathematics.

Brin, S. & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(107), pp. 107-117.

Broder, A. & Henzinger, M. (2002). Algorithmic aspects of information retrieval on the Web. In J. Abello, M. Pardalos, & M.G.C. Resende (Eds.). Handbook of massive data sets (pp. 3-23). Dordrecht, The Netherlands: Kluwer Academic Publishers.

Buckley, C. & Voorhees, E.M. (2005). Retrieval system evaluation. In E.M. Voorhees, & D.K. Harman (Eds.). TREC Experiment and Evaluation in Information Retrieval (pp. 53- 75). Cambridge MA: The MIT Press.

Carpineto, C., Romano, G., & Giannini, V. (2002). Improving Retrieval Feedback with Multiple Term-Ranking Function Combination, ACM Transactions on Information Systems , 20(3), pp. 259-290.

Chen, I.-X., Ho, J.-C., & Yang, C.-Z. (2006). On hierarchical web catalog integration with conceptual relationships in thesaurus. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 635-636.

Chen, L., Zeng, J., & Tokuda, N.A. (2006). A “stereo” document representation for textual information retrieval, Journal of the American Society for Information Science and Technology , 57(6), pp. 768-774.

Cheung, K.-H., Yip, K.Y., Smith, A. deKnikker, R., Masiar, A., & Gerstein, M. (2005). YeastHub: a semantic web use case for integrating data in the life sciences domain. Bioinformatic, 21(Suppl. 1) , pp. i85–i96.

Chiang, R.H.L., Chua, C.E.H., & Storey, V.C. (2001). A smart web query method for semantic retrieval of web data, Data and Knowledge Engineering , 38(1), pp. 63-84.

Chowdhury, G.G. (2004). Introduction to modern information retrieval (2 nd ed.). London: Facet Publishing.

Cleverdon, C.W. (1991). The significance of the Cranfield tests on index languages. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 3-12.

182 Codd, E.F. (1970). A Relational Model of Data for Large Shared Databanks, Communications of the ACM , 13(6), pp. 377-387.

Cooper, W.S., Gey, F.C., & Dabney, D.P. (1992). Probabilistic retrieval based on staged logistic regression. In SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 198–210.

Crestani, F., Lalmas, M., van Rijsbergen, C.J., & Campbell, I. (1998). “Is this document relevant?…probably”: a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR) , 30(4), pp. 528-552.

Croft, W.B. & Turtle, H.R. (1992). Text retrieval and inference. In P.S. Jacobs (Ed.). Text-based intelligent systems: current research and practice in information extraction and retrieval (pp. 130-133). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Daisy Publishing Company Inc. Dba Hi-Torque Public . (2001). Retrieved March 11, 2008, from http://www.osha.gov/pls/imis/accidentsearch.accident_inspection?line_item=1&id=7266 3

De Valk, J. (2007, July). Google websearch parameters . Retrieved March 26, 2008, from http://www.joostdevalk.nl/wp-content/uploads/2007/07/google-url-parameters.pdf

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., & Harshman, R. (1990). Indexing by latent semantic analysis, Journal of the American Society for Information Science , 41(6), pp. 391-407

Definzio Imports, Inc . (1987). Retrieved March 11, 2008, from http://www.osha.gov/pls/imis/accidentsearch.accident_inspection?line_item=7&id=2383 4

Ding, C.H.Q. (2005). A probabilistic model for latent semantic indexing, Journal of the American Society for Information Science and Technology , 56(6), pp. 597-608.

Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R.S., Peng, Y., Reddivari, P., Doshi, V., & Sachs, J. (2004). Swoogle: a search and metadata engine for the semantic web. In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management , pp. 652-659.

Efron, M. (2005). Eigenvalue-based model selection during latent semantic indexing, Journal of the American Society for Information Science , 56(9), pp. 969-988. entailment, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= entailment

183 Eurovoc Thesaurus . (2007). Retrieved June 28, 2008, from http://europa.eu/eurovoc/sg/sga_doc/eurovoc_dif!SERVEUR/menu!prod!MENU?langue =EN.

Foo, S., Hui, S.C., Lim, H.K., & Hui, L. (2000). Automatic thesaurus for enhanced Chinese text retrieval. Library Review , 49(5), pp. 230-240.

Frakes, W.B. (1992a). Introduction to information storage and retrieval systems. In W.B. Frakes & R. Baeza-Yates (Eds.). Information Retrieval Data Structures & Algorithms (pp. 1-12). Englewood Cliffs, NJ: Prentice Hall.

Frakes, W.B. (1992b). Stemming algorithms. In W.B. Frakes & R. Baeza-Yates (Eds.). Information Retrieval Data Structures & Algorithms (pp. 131-160). Englewood Cliffs, NJ: Prentice Hall.

Fred Meyer Stores Inc . (2001). Retrieved February, 25, 2008, from http://www.osha.gov/pls/imis/accidentsearch.accident_inspection?line_item=1&id=7789 0

Fuhr, N. & Buckley, C. (1991). A probabilistic learning approach for document indexing. ACM Transactions on Information Systems (TOIS ), 9(3), pp. 223-248.

Fun with Google's New Synonym Operator . (2003, August 24). Retrieved March 11, 2008, from http://www.waxy.org/archive/2003/08/04/fun_with.shtml

Furnas, G.W., Landauer, T.K., Gomez, L.M., & Dumais S.T. (1987). The vocabulary problem in human-system communication, Communications of the ACM , 30(11), pp. 964-971.

Geller, V.J. & Lesk, M.E. (1983). User interfaces to information systems: choices vs. commands. In Proceedings of the 6th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 130-135.

Glossary of WordNet Terminology . (2006). Retrieved June 25, 2008, from http://wordnet.princeton.edu/gloss

Google searches more sites more quickly, delivering the most relevant results . (2008). Retrieved April 29, 2008, from http://www.google.com/technology/

Google SOAP Search API Reference . (2008), Retrieved May 1, 2008, from http://code.google.com/apis/soapsearch/reference.html

Greenberg, J. (2001). Optimal Query Expansion (QE) Processing Methods with Semantically Encoded Structured Thesauri Terminology. Journal of the American Society for Information Science and Technology , 52(6), pp. 487-498.

184 Grossman, D. (2001), Relevance Feedback, Retrieved May 1, 2008, from http://www.ir.iit.edu/~dagr/cs529/files/handouts/07Feedback.pdf

Grossman, D.A. & Frieder, O. (2004). Information Retrieval (2 nd Ed.). Dordrecht, The Netherlands: Springer.

Guthrie, J.A., Guthrie, L., Wilks, Y., & Aidinejad, H. (1991). Subject-dependent co- occurrence and word sense disambiguation. In Proceedings of the 29th annual meeting on Association for Computational Linguistics , pp. 146-152.

Haaland, D. E. (2006, April 17). Safety first: hire conscientious employees to cut down on costly workplace accidents. Nation's Restaurant News , Retrieved March 1, 2008, from http://findarticles.com/p/articles/mi_m3190/is_16_40/ai_n16130311

Hardmeier, S. (2006, April 27). Adapting to the brave new world of Internet Explorer 7 . Retrieved February 29, 2008, from http://www.microsoft.com/windows/ie/community/columns/adapting.mspx

Harman, D. (1992a). Ranking algorithms. In W.B. Frakes & R. Baeza-Yates (Eds.). Information Retrieval Data Structures & Algorithms (pp. 363-392). Englewood Cliffs, NJ: Prentice Hall.

Harman, D. (1992b). Relevance feedback and other query modification techniques. In W.B. Frakes & R. Baeza-Yates (Eds.). Information Retrieval Data Structures & Algorithms (pp. 241-263). Englewood Cliffs, NJ: Prentice Hall.

Hearst, M., Elliott, A., English, J., Sinha, R., Swearingen, K., & Yee, K.-P. (2002). Finding the flow in Web site search. Communications of the ACM , 45(9), pp. 42-49. holonymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= holonymy hypernymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= hypernymy hyponymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= hyponymy

Imai, H., Collier, N., & Tsujii, J. (1999). A combined query expansion approach for information retrieval. Genome Informatics Series, (Sers 10), pp. 292-293.

Jansen, B.J. & Pooch, U. (2001). A review of Web searching studies and a framework for future research. Journal of the American Society for Information Science and Technology , 52(3), pp. 235–246.

185 Jing, Y. & Croft, W.B. (1994). An association thesaurus for information retrieval. In Proceedings of RIAO-94, 4th International Conference “Recherche d’Information Assistee par Ordinateur ”, pp. 146–160.

Korfhange, R.R. (1997). Information Storage and Retrieval . New York: John Wiley & Sons, Inc.

Larkin, P. (2000). Social security. Compensation for repetitive strain injury. Industrial Law Journal , 29(1), pp. 88-95

Liu, F., Yu, C., Meng, W., & Chowdhury, A. (2006). Effective keyword search in relational databases. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data , pp. 563-574. New York: ACM.

Logging Fatality Investigation Reports (n.d.). Retrieved March 1, 2008, from http://www.cdc.gov/Niosh/injury/traumalgface.html

Lorch, D. (1992, June 3). Hand Injuries In Workplace Ignite Battle. Retrieved April 15, 2008, The New York Times , from http://query.nytimes.com/gst/fullpage.html?res=9E0CE2DA173AF930A35755C0A9649 58260

Luhn, H.P. (1958). The Automatic Creation of Literature Abstracts, IBM Journal of Research and Development , 2(2), pp. 159-165.

Mandala, R., Tokunaga, T., & Tanaka, H. (2000). Query expansion using heterogeneous thesauri. Information Processing & Management , 36(3), pp. 361-378.

Manning, C.D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval [Electronic book]. Cambridge: Cambridge University Press. Retrieved April 23, 2008, from http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Marchionini, G. (1995). Information Seeking in Electronic Environments . Cambridge: Cambridge University Press.

Margulis, E.L. (1992). N-Poisson document modelling. In SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval , pp. 177-189.

Maron, M.E. & Kuhns, J.L. (1960). On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM , 7(3), pp. 216-244.

McDonald, D.M., & Chen, H. (2006). Summary in context: Searching versus browsing. ACM Transactions on Information Systems (TOIS), 24(1), pp. 111-141.

186 Meadow, C.T., Boyce, B.R., Kraft, D.H., & Barry, C. (2007). Text Information Retrieval Systems (3 rd Ed.), London, UK: Elsevier, Inc.

MEDLINE . (2008). Retrieved June 30, 2008, from http://www.nlm.nih.gov/pubs/factsheets/medline.html meronymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= meronymy

Minker, J. (1977). Information storage and retrieval: a survey and functional description. ACM SIGIR Forum , 12(2), pp. 12-108.

Nielsen online reports topline U.S. data from April 2008 . (2008, April). Retrieved May 17, 2008, from http://www.nielsen-netratings.com/pr/pr_080515.pdf

Occupational Safety and Health Act of 1970: Public Law 91-596, 84 STAT. 1590, 91 st Congress, S.2193, December 29, 1970, as amended through January 1, 2004 . (2004). Retrieved February 25, 2008, from http://www.osha.gov/pls/oshaweb/owadisp.show_document?p_table=OSHACT&p_id=2 743

Overview . (2007). Retrieved May 4, 2008, from http://trec.nist.gov/overview.html

OWL Web Ontology Language: Overview (2004), Retrieved May 20, 2008, from http://www.w3.org/TR/owl-features/

Papadimitriou, C., Raghavan, P., Tamaki, H., & Vempala, S., (1998). Latent semantic indexing: a probabilistic analysis. Journal of Computer and System Sciences, 61(2), pp. 217-235.

Polissar, L. & Diehr, P. (1982). Regression analysis in health services research: the use of dummy variables. Medical Care , 20(9), pp. 959-966.

Porter, M. (1980). An algorithm for suffix stripping, Program , 14(3), pp. 130-137.

Porter, M. (2006, Jan.). The Porter stemming algorithm , Retrieved April 27, 2009, from http://www.tartarus.org/~martin/PorterStemmer/

Prewitt, M. (2001, November 10). Hand injuries on the rise among kitchen workers: on- the-job cuts and cost industry an estimated $300m each year in medical fees, lost labor. Nation's Restaurant News , p. 1, 83.

Rada, R. & Martin, B.K. (1987). Augmenting thesauri for information systems, ACM Transactions on Office Information Systems , 5(4), pp. 378-392.

187 Ranganathan, A. & Liu, Z. (2006). Information retrieval from relational databases using semantic queries. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management , pp. 820-821. New York: ACM.

Results for antonym . (2008). Retrieved July 14, 2008 from http://www.answers.com/antonymy

Results for holonymy . (2008). Retrieved July 14, 2008 from http://www.answers.com/holonymy?cat=technology

Results for hypernym . (2008). Retrieved July 14, 2008 from http://www.answers.com/hypernym

Results for hyponym . (2008). Retrieved July 14, 2008 from http://www.answers.com/hyponym

Results for meronymy . (2008). Retrieved July 14, 2008 from http://www.answers.com/meronymy

Rieh, S.Y. & Xie, H.I. (2006). Analysis of multiple query reformulations on the web: The interactive information retrieval context. Information Processing & Management , 42(3), pp. 751–768.

Riekert, W. (2002). Automated retrieval of information in the Internet by using thesauri and gazetteers as knowledge sources. Journal of Universal Computer Science , 8(6), pp. 581-590.

Riordan, R.M. (2005). Designing Effective Database Systems . Upper Saddle River, NJ: Pearson Education Inc.

Robertson, S.E. & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science , 27(3), pp. 129-146.

Robins, D. (2000). Interactive Information Retrieval: Context and Basic Notions. Informing Science , 3(2), pp. 57-61.

Rubin, R.E. (2004). Foundations of Library and Information Science (2 nd Ed.). New York: Neal-Schuman Publishers, Inc.

Salton, G. & McGill, M.J. (1983). Introduction to Modern Information Retrieval . New York: McGraw-Hill.

Salton, G, Fox, E.A., & Wu, H. (1983). Extended Boolean information retrieval. Communications of the ACM , 26(11), pp. 1022-1036.

188 Selected Occupational Fatalities Related to Vehicle-Mounted Elevating and Rotating Work Platforms as Found in Reports of OSHA Fatality/Catastrophe Investigations . (1991). Retrieved February 25, 2008, from http://www.osha.gov/FatCat/fatcat.html

Shah, U., Finin, T., Joshi, A., Cost, R.S., & Mayfield, J. (2002). Information retrieval on the semantic web. In CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management , pp. 461-468.

Site Feature . (2008). Retrieved May 8, 2008, from http://help.ask.com/en/docs/about/site_features.shtml

Smith, P. (1995). Barriers to the Effective Search and Exploration of Large Document Databases. ACM SIGOIS Bulletin , 16(2), p. 39

Sparck Jones, K. (1992). Assumptions and issues in text-based retrieval. In P.S. Jacobs (Ed.). Text-based intelligent systems: current research and practice in information extraction and retrieval (pp. 157-177). Hillsdale, NJ: Lawrence Erlbaum Associates, Inc.

Sparck Jones, K. (2000). Further reflections on TREC. Information Processing and Management , 36, pp. 37-85.

Srinivasan, P. (1992). Thesaurus construction. In W.B. Frakes & R. Baeza-Yates (Eds.). Information Retrieval Data Structures & Algorithms (pp. 161-218). Englewood Cliffs, NJ: Prentice Hall.

Statistics & Data . (2008). Retrieved February 25, 2008, from http://www.osha.gov/oshstats/index.html

Struttmann, T.W. (2004, December 10). Fatal and Nonfatal Occupational Injuries Involving Wood Chippers --- United States, 1992—2002. Morbidity and Mortality Weekly Report , 53(48), 1130-1131. Retrieved March 1, 2008, from http://www.cdc.gov/mmwr/preview/mmwrhtml/mm5348a2.htm

Sugarhouse Barbecue . (2001). Retrieved February 25, 2008, from http://www.osha.gov/pls/imis/accidentsearch.accident_inspection?line_item=1&id=6216 6

Suits, D.B. (1984). Dummy variables: mechanics v. interpretation. The Review of Economics and Statistics , 66(1), pp. 177-180.

Switzerland . (n.d.). Retrieved March 1, 2008, from http://laborsta.ilo.org/applv8/data/SSM8/E/CH.html

Tanin, E., Shneiderman, B. & Xie, H. (2007). Browsing large online data tables using generalized query previews. Information Systems , 32(3), pp. 402-423.

189 The Essentials of Google Search . (2008). Retrieved March 11, 2008, from http://www.google.com/support/bin/static.py?page=searchguides.html&ctx=basics

TIPSTER Text Program A multi-agency, multi-contractor program . (2001). Retrieved May 16, 2008, from http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/ tony.thompson. (2002, June 25). Re: Server returned HTTP response code: 403 for URL. Message posted to Developer Forums: Java Technologies for Web Services – Server returned HTTP response code: 403 for URL . Retrieved February 29, 2008 from http://forum.java.sun.com/thread.jspa?threadID=270107&messageID=1034378 troponymy, WordNet . (2008). Retrieved July 14, 2008 from http://wordnet.princeton.edu/perl/webwn?s= troponymy

Tudhope, D., Binding, C., Blocks, D., & Cunliffe, D. (2006). Query expansion via conceptual distance in thesaurus indexed collections. Journal of Documentation , 62(4), pp. 509-533.

Turtle, H.R. & Croft, W.B. (1991). Evaluation of an inference network-based retrieval model, ACM Transactions on Information Systems (TOIS), 9(3), pp. 187-222.

Understanding User-Agent Strings . (2008). Retrieved February 29, 2008, from Microsoft Developer Network Library, http://msdn2.microsoft.com/en-us/library/ms537503.aspx van Eijsden-Besseling, M.D.F., Peeters, F.P.M.L., Reijnen, J.A.W., & de Bie, R.A. (2004). Perfectionism and coping strategies as risk factors for the development of non- specific work-related upper limb disorders (WRULD). Occupational Medicine , 54(2), pp. 122-127. van Rijsbergen, C.J. (1979). Information Retrieval (2 nd Ed.) [Electronic book]. London: Butterworths. Retrieved April 20, 2008, from http://www.dcs.gla.ac.uk/Keith/Preface.html

Varelas, G. Voutsakis, E., Raftopoulou, P., Petrakis, E.G.M., & Milios, E.E. (2005). Semantic similarity methods in WordNet and their application to information retrieval on the web. In WIDM '05: Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management , pp. 10-16.

Weeds, J. & Weir, D. (2005). Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity. Computational Linguistics , 31(4), pp. 439-475.

What is Search Assist? (2008). Retrieved on May 3, 2008, from http://help.yahoo.com/l/us/yahoo/search/basics/basics-27.html

190 White, W.R. & Marchionini, G. (2006). A Study of Real-Time Query Expansion Effectiveness. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 715-716.

Wolfram, D. (2003). Applied Informetrics for Information Retrieval Research . Westport, CT: Libraries Unlimited, Inc.

Wolfram, D. (2008). Search characteristics in different types of Web-based IR environments: Are they the same? Information Processing and Management , 44(3), pp. 1279–1292.

Wong, S.K.M. & Yao, Y.Y. (1995). On modelling information retrieval with probabilistic inference. ACM Transactions on Information Systems (TOIS) , 13(1), pp. 38- 68.

Wu, Z., Yu, T., & Chen, H. (2008). Information retrieval and knowledge discovery on the semantic web of traditional Chinese medicine. In WWW '08: Proceeding of the 17th international conference on World Wide Web , pp. 1085-1086.

Xu, J. & Croft, W.B. (2000). Improving the Effectiveness of Information Retrieval with Local Context Analysis, ACM Transactions on Information Systems , 18(1), pp. 79-112.

Yu, H., Mine, T., & Amamiya, M. (2005). An architecture for personal semantic web information retrieval system. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web , pp. 974-975.

191

APPENDIX A

INVESTIGATION SUMMARY FORM (OSHA-170 FORM)

192

Source: OSHA, http://www.osha.gov/FatCat/fatcat.html

193

APPENDIX B

INVESTIGATION SUMMARY CODES

194 NATURE OF INJURY CODES

01 12 Fracture 02 Asphyxia 13 Freezing Frost/Bite 03 Bruise/Contusion/Abrasion 14 Hearing Loss 04 Burn (Chemical) 15 Heat Exhaustion 05 Burn/Scald (Heat) 16 Hernia 06 Concussion 17 Poisoning (Systemic) 07 Cut/Laceration 18 Puncture 08 Dermatitis 19 Radiation Effects 09 Dislocation 20 Strain/ 10 Electric Shock 21 Other 11 in Eye 22 Cancer

PART OF BODY CODES

01 Abdomen 17 Lower Arm(s) 02 Arm(s) Multiple 18 Lower Leg(s) 03 Back 19 Multiple 04 Body System 20 Neck 05 Chest 21 Shoulder 06 Ear(s) 22 Upper Arm(s) 07 Elbow(s) 23 Upper Leg(s) 08 Eye(s) 24 Wrist(s) 09 Face 25 Blood 10 Finger(s) 26 Kidney 11 Foot/Feet/Toe(s)Ankle(s) 27 Liver 12 Hand(s) 28 Lung 13 Head 29 Nervous System 14 Hip(s) 30 Reproductive System 15 Knee(s) 31 Other Body System 16 Leg(s)

195 SOURCE OF INJURY CODES

01 Aircraft 24 Hoisting Apparatus 02 Air Pressure 25 Ladder 03 Animal/Insect/Bird/Reptile/Fish 26 Machine 04 Boat Equipment 27 Materials Handling 05 Bodily Motion 28 Metal Products 06 Boiler/Pressure 29 Motor Vehicle (Highway) 07 Boxes/Barrels, etc. 30 Motor Vehicle (Industrial) 08 Buildings/ Structures 31 Motorcycle 09 Chemical Liquids/Vapors 32 Windstorm/Lighting, etc. 10 Cleaning Compound 33 Firearm 11 Cold (Environmental/Mechanical) 34 Person 12 Dirt/Sand/Stone 35 Petroleum Products 13 Drugs/Alcohol 36 Pump/Prime Mower 14 Dust/Particles/Chips 37 Radiation 15 Electrical Apparatus/Wiring 38 Train/Railroad Equipment 16 Fire/Smoke 39 Vegetation 17 Food 40 Waste Products 18 Furniture/Furnishings 41 Water 19 Gases 42 Working Surface 20 Glass 43 Other 21 Hand Tool (Powered) 44 Fume 22 Hand Tool (Manual) 45 Mists 23 Heat (Environmental/Mechanical) 46 Vibration 47 Noise 48 Biological Agent

196 EVENT TYPE CODES

01 Struck By 09 Ingestion 02 Caught In or Between 10 Absorption 03 Bite/Sting/Scratch 11 Repeated Motion/Pressure 12 Cardio-Vascular/Respiratory System 04 Fall (Same Level) Failure 05 Fall (From Elevation) 13 Shock 06 Struck Against 14 Other 07 Rubbed/Abraded 08 Inhalation

ENVIRONMENTAL FACTOR CODES

01 Pinch Point Action 02 Catch Point/Puncture Action 03 Shear Point Action 04 Squeeze Point Action 05 Flying Object Action 06 Overhead Moving and/or Falling Object Action 07 Gas/Vapor/Mist/Fume/Smoke/Dust Condition 08 Materials Handling Equipment/Method 09 Chemical Action/Reaction Exposure 10 Flammable Liquid/Solid Exposure 11 Temperature Above or Below Tolerance Level 12 Radiation Condition 13 Working Surface/Facility Layout Condition 14 Illumination 15 Overpressure/Underpressure Condition 16 Sound Level 17 Weather/Earthquake, etc. Condition 18 Other

197 HUMAN FACTOR CODES

01 Misjudgement of Hazardous Situation 04 Malfunction of Procedure for Securing Operation or Warning of Hazardous Situation 05 Distracting Actions by Others 06 Equipment in Use Not Appropriate for Operation or Process 07 Malfunction of Neuro-Muscular System 08 Malfunction of Perception System with Respect to Task Environment 09 Safety Devices Removed or Inoperative 10 Operational Position Not Appropriate for Task 11 Procedure for Handling Materials Not Appropriate for Task 12 Defective Equipment: Knowingly Used 13 Malfunction of Procedure for Lock-Out or Tag-Out 14 Other 15 Insufficient or Lack of Housekeeping Program 16 Insufficient or Lack of Exposure or Biological Monitoring 17 Insufficient or Lack of Engineering Controls 18 Insufficient or Lack of Written Work Practices Program 19 Insufficient or Lack of Respiratory Protection 20 Insufficient or Lack of Protection Work Clothing and Equipment

Source: OSHA, http://www.osha.gov/FatCat/fatcat.html

198

APPENDIX C

QUESTIONNAIRES

199

Protocol # 2007E0453

CONSENT FOR PARTICIPATION IN RESEARCH

I consent to participating in (or my child's participation in) research entitled: Enhancing Information Retrieval and Display through the Use of Domain-Specific Knowledge.

Dr. Phil Smith, Principal Investigator, or his/her authorized representative Shih-Kwang Chen has explained the purpose of the study, the procedures to be followed, and the expected duration of my (my child’s) participation. Possible benefits of the study have been described, as have alternative procedures, if such procedures are applicable and available.

I acknowledge that I have had the opportunity to obtain additional information regarding the study and that any questions I have raised have been answered to my full satisfaction. Furthermore, I understand that I am (my child is) free to withdraw consent at any time and to discontinue participation in the study without prejudice to me (my child).

Finally, I acknowledge that I have read and fully understand the consent form. I sign it freely and voluntarily. A copy has been given to me.

Date: ______Signed: ______(Participant)

Signed: ______Signed: ______(Principal Investigator or his/her authorized (Person authorized to consent for participant, if required) representative)

Witness: ______

HS-027E Consent for Participation in Exempt Research

200 Section I – General Demographics

In answering the following questions, please indicate the answers that best describe you.

1. How long have you been using the Internet (including e-mail, http, ftp, etc.)?

○ Less than 1 year ○ 1 to 3 years ○ 4 to 6 years ○ 7 to 10 years ○ 10 years or more

2. Where do you access the Internet?

□ From home (including a home office) □ From school □ From work □ From a public terminal (e.g., library, cybercafe, etc.) □ From other places

3. How often do you perform Internet searches?

○ Daily ○ Weekly ○ Monthly ○ Less than once a month ○ Never

4. Which search engines do you use for Internet searches and roughly how often relative to each other? Search Engine Usage (%) 0 10 20 30 40 50 60 70 80 90 100 □ Google ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ http://www.google.com □ MSN Live Search ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ http://www.live.com □ Yahoo! ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ http://search.yahoo.com □ Ask.com ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ http://www.ask.com □ Others, please indicate: ______○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ______

0 10 20 30 40 50 60 70 80 90 100

201 5. Are you a college student?

○ No ○ Yes – major(s): ______

6. Please indicate the highest level of education completed.

○ Grammar School ○ High School or equivalent ○ Vocational/Technical School (2 year) ○ Some College ○ College Graduate (4 year) – Major(s): ○ Master's Degree (M.S., etc.) – Major(s): ○ Doctoral Degree (Ph.D.) or advanced Professional Degree (M.D., J.D., etc.) – Major(s) : ○ Other: ______

7. Gender

○ Female ○ Male

8. Age: ______

202 Section II – Demonstration of the Search Tool (Control Group only)

The Homepage of this search tool is a simple html page (Figure 1) as you would find in many other commercial search engines. It contains a textbox and a Search button next to the textbox. The textbox allows users to key in search queries. Users submit search queries by clicking on the Search button instead of using the “Enter” key.

Figure 1 Homepage of this search tool

Because this search tool uses ActiveX control, Internet Explorer will show a pop- up box to alert users of using ActiveX control every time a button is clicked, a new html search result page is displayed, or a current html page is refreshed. Please click “Yes” to continue to allow this ActiveX control interaction. (See Figure 2)

203

Figure 2 A pop-up warning by Internet Explorer

After submitting the search query, the search tool retrieves results from Google and displays the retrieved documents in a newly generated html page – the Result Page (Figure 3).

Figure 3 A sample Result Page

204 The newly generated html page - the Result Page, is composed of four sections: Top, Left, Right, and Bottom.

The Top section contains the textbox and the Search button as in the homepage of this search tool. The textbox will display the most recent query to allow users to append or modify the query for a new search. Users may also simply enter a new query to start a new search.

The Left section contains the title “Search Query”. It displays the original search query and the number of retrievals provided in Google’s outcomes.

The Right section contains the title “Search Results”. It displays the results retrieved from Google. Information of retrieved documents are numbered and displayed in this Right section on each Result Page.

The Bottom section provides links to additional retrieved documents.

For the purpose of this study, only the first one hundred documents from Google are retrieved. The first Result Page displays the information of the first forty documents. The second page with the information of the next forty documents can be accessed from the link 2 at the Bottom section, so does the third page, i.e., link 3, to documents number 81 to 100. (Figure 4)

The display of information of each document is similar to Google’s display. The information includes a headline with hyperlink to the actual document, a brief description of the document, links to “Cached”, “Similar pages”, or “More results ...” provided by Google at the bottom of the description of each document if available. In addition, we add an index number by each document which is not found in Google’s display.

205

Figure 4 Links to more documents at the bottom of the page

206 Section II – Demonstration of the Search Tool (Treatment Group only)

The Homepage of this search tool is a simple html page (Figure 1) as you would find in many other commercial search engines. It contains a textbox and a Search button next to the textbox. The textbox allows users to key in search queries. Users submit search queries by clicking on the Search button instead of using the “Enter” key.

Figure 1 Homepage of the search tool

Because this search tool uses ActiveX control, Internet Explorer will show a pop- up box to alert users of using ActiveX control every time a button is clicked, a new html page is displayed, or a current html page is refreshed. Please click “Yes” to continue to allow this ActiveX control interaction. (See Figure 2)

207

Figure 2 A pop-up warning by Internet Explorer

After submitting the search queries, the search tool retrieves results from Google and displays the retrieved documents in a newly generated html page – the Result Page. (Figure 3)

Figure 3 A sample Result Page

208 The newly generated html page - the Result Page, is composed of four sections: Top, Left, Right, and Bottom.

The Top section contains the textbox and the Search button as in the homepage of this search tool. The textbox will display the most recent query to allow users to append or modify the query for a new search. Users may also simply enter a new query to start a new search.

The Left section contains the title “Search Query” and the following three segments: Original Search, Alternative Search, and Modified Search.

The Original Search displays the information of the original search query and the number of retrievals provided in Google’s outcomes. It keeps a record of what the users’ original query is to reduce users’ memory load. The Run Original Search button helps users to revisit the search outcomes of the original search query without typing the query again.

The Alternative Search is similar to the Original Search but also searching for the synonyms of the original query by adding a tilde sign (“~”) in front of each term. It displays the Alternative Search query with a tilde sign and the number of retrievals of the Alternative Search. The Run Alternative Search button performs search with this alternative query and displays the outcomes in the Right section with the label “Alternative Search” at the top of the Right section when the Results Page is generated. The function for synonyms search is one of Google’s Advanced Search functions.

Based on the knowledge of occupational injuries retrieved from OSHA’s Accident Investigation Search results, terms related to occupational accidents are provided in the area of Suggested refinements for Alternative Search. These terms are generated based on users’ original search query. Users may refine the original search by selecting suggested terms to broaden (Figure 4) or narrow (Figure 5) the search. By clicking on the Show Modified Search button, the modified query based on the refinements will be displayed. (Figure 6)

209

Figure 4 Suggested refinements to broaden the search

210

Figure 5 Suggested refinements to narrow the search

211

Figure 6 A dialog box displays the modified search query based on the refinements

Click the Run Modified Search button will generate a new set of search outcomes based on the modified query the users have selected. It is the same query displayed by clicking the Show Modified Search button. The newly generated Result Page based on the Modified Search will display the new modified query above the Run Modified Search button and the number of retrievals. (Figure 7)

The Right section contains the title “Search Results” and displays the results retrieved from Google. Information of retrieved documents are numbered and displayed in this Right section on each Result Page. If the results are generated by the Original Search, Alternative Search, or Modified Search, the title of this Right section will show highlighted “Original Search”, “Alternative Search”, or “Modified Search”, respectively.

The Bottom section provides links to additional retrieved documents.

212 For the purpose of this study, only the first one hundred documents from Google are retrieved. The first Result Page displays the information of the first forty documents. The second page with the information of the next forty documents can be accessed from the link 2 at the Bottom section, so does the third page, i.e., link 3, to documents number 81 to 100. (Figure 8)

Figure 7 A sample Result Page of Modified Search

213

Figure 8 Links to more documents at the bottom of the page

The display of information of each document is similar to Google’s display. The information includes a headline with hyperlink to the actual document, a brief description of the document, links to “Cached”, “Similar pages”, or “More results ...” provided by Google at the bottom of the description of each document if available. In addition, we add an index number by each document which is not found in Google’s display.

214 Section III – Search Process

Please take as long as you need to complete the following task.

Assume:

1. You are taking a college course dealing with occupational accidents.

2. You have been asked to write a term paper for this course based on some topic related to the following two documents. (See attachments)

1) An unusual injury of the hand in a plastic injection moulding machine Source: http://www.ijps.org/article.asp?issn=0970- 0358;year=1995;volume=28;issue=1;spage=31;epage=32;aulast=Aiyer;type=0

2) Hand injuries on the rise among kitchen workers: on-the-job cuts and burns cost industry an estimated $300m each year in medical fees, lost labor Source: http://findarticles.com/p/articles/mi_m3190/is_47_39/ai_n15923107

3. Based on the content of these two documents, please write down the topic you would like to use for your term paper for this course on occupational accidents.

Topic:

4. To find information to help you write this paper, you should use the search site that I just showed you. This site is using the Google search engine to retrieve information. Please enter your search the entry box labeled Search on the computer.

5. Look over your retrievals. If you think you need a better set of retrievals to find the information you need to write your term paper for the course on occupational accidents, you can refine your search by entering a new search.

6. If you think you’ve found the information you need to write your paper, you can stop.

215 7. Please indicate how each document is relevant to your search queries.

Highly Highly Retrieval 1 2 3 4 5 NA Irrelevant Relevant

1 ○ ○ ○ ○ ○ ○ 2 ○ ○ ○ ○ ○ ○ 3 ○ ○ ○ ○ ○ ○ 4 ○ ○ ○ ○ ○ ○ 5 ○ ○ ○ ○ ○ ○ 6 ○ ○ ○ ○ ○ ○ 7 ○ ○ ○ ○ ○ ○ 8 ○ ○ ○ ○ ○ ○ 9 ○ ○ ○ ○ ○ ○ 10 ○ ○ ○ ○ ○ ○

11 ○ ○ ○ ○ ○ ○ 12 ○ ○ ○ ○ ○ ○ 13 ○ ○ ○ ○ ○ ○ 14 ○ ○ ○ ○ ○ ○ 15 ○ ○ ○ ○ ○ ○ 16 ○ ○ ○ ○ ○ ○ 17 ○ ○ ○ ○ ○ ○ 18 ○ ○ ○ ○ ○ ○ 19 ○ ○ ○ ○ ○ ○ 20 ○ ○ ○ ○ ○ ○

21 ○ ○ ○ ○ ○ ○ 22 ○ ○ ○ ○ ○ ○ 23 ○ ○ ○ ○ ○ ○ 24 ○ ○ ○ ○ ○ ○ 25 ○ ○ ○ ○ ○ ○ 26 ○ ○ ○ ○ ○ ○ 27 ○ ○ ○ ○ ○ ○ 28 ○ ○ ○ ○ ○ ○ 29 ○ ○ ○ ○ ○ ○ 30 ○ ○ ○ ○ ○ ○

31 ○ ○ ○ ○ ○ ○ 32 ○ ○ ○ ○ ○ ○ 33 ○ ○ ○ ○ ○ ○ 34 ○ ○ ○ ○ ○ ○ 35 ○ ○ ○ ○ ○ ○ 36 ○ ○ ○ ○ ○ ○ 37 ○ ○ ○ ○ ○ ○ 38 ○ ○ ○ ○ ○ ○ 39 ○ ○ ○ ○ ○ ○ 40 ○ ○ ○ ○ ○ ○

Highly Highly Retrieval 1 2 3 4 5 NA Irrelevant Relevant 216

217

218 Section IV – Usability of Interface (Control Group only)

Please rate the usefulness and usability of the system. • Try to respond to all the items. • For items that are not applicable, use: NA .

1 2 3 4 5 NA 1. This search tool helped me strongly ○ ○ ○ ○ ○ strongly ○ to better refine my searches. disagree agree

2. I was successful in finding strongly ○ ○ ○ ○ ○ strongly ○ the information I need to disagree agree write my term paper. 1 2 3 4 5 NA

3. Additional comments.

219 Section IV – Usability of Interface (Treatment Group only)

Please rate the usefulness and usability of the system. • Try to respond to all the items. • For items that are not applicable, use: NA .

1 2 3 4 5 NA 1. It was easy to use this search strongly ○ ○ ○ ○ ○ strongly ○ tool. disagree agree

2. This search tool helped me strongly ○ ○ ○ ○ ○ strongly ○ to better refine my searches. disagree agree

3. This search tool helped me strongly ○ ○ ○ ○ ○ strongly ○ to better understand what I disagree agree was retrieving.

4. I was successful in finding strongly ○ ○ ○ ○ ○ strongly ○ the information I need to disagree agree write my term paper. 1 2 3 4 5 NA

5. List the most negative aspect(s):

6. List the most positive aspect(s):

7. What aspects of this search tool are especially important?

8. Additional comments.

220 Receipt

I acknowledge that I have received a $20 cash incentive for my participation of this study evaluating the new search tool for information retrieval from the Internet.

Name:

Email address:

Phone number:

Signature Date

221

APPENDIX D

ANSWERS TO OPEN-ENDED QUESTIONS

222 5. List the most negative aspect(s):

Participant The Most Negative Aspect(s) 001 I had to learn the search syntax if I wanted to manually change the keyword search with a term not already suggested by the engine. 008 It didn't have same operators "And" & "Or" that I was used to. It was limiting in that there weren't infinite check box possibilities. 010 Too simple, almost too dumb, a lot of simple text so easy to over look. 011 A couple of my top searches came up with information about insurance - completely unrelated. 012 There were some irrelevant retrievals such as legal or attorney websites. 014 Training my brain to look over to the left side of the screen. 015 It was different than what I'd used before. 016 When showing the modified search, it was in a format that was not great to read. Also some search engines have a relevance part tell you the relevance of a site. This one didn't. 017 The amount of searches that came up was extremely large. 018 ActiveX control pops out. 022 Not being able to use of and or in searches. 024 Reenter every key words. 025 In searching for "injuries" - the eye was the primary site affected that was mentioned. 026 I would like to search within my already found results. 027 None. 031 N/A. 033 Lack of graphic design. 036 My natural tendency to view details for the search is on the right side of the screen. The search query details were on the left side. 037 No enter button. Agent Smith should be more prominent (if it is the name of search engine). 038 Not enough description. 039 The articles retrieved don't have the whole title displayed. When using keyword to retrieve certain publication in NIOSH, it didn't link to the document directly. 040 None. 045 A little slow to generate the search results. 051 I think the side bar is a little confusing. (The participant explained that s/he uses Google and Google does not have a side bar.) 054 Some words appeared in phrases are considered as key words phrase used in search are separated. 055 Some of the narrowing search agents weren't relevant. 056 N/A. 057 Results that led me to click on other links - I like it to be as direct as possible.

223 059 Larger abstracts would be nice. (Probably would reduce need to expend time/energy invested in looking for more info about individual links.) 060 Couldn't use the "" tool that Google provides.

6. List the most positive aspect(s):

Participant The Most Negative Aspect(s) 001 The suggested terms gave me new ideas of other related searches I wouldn't otherwise have thought of. 008 Check boxes were very nice. Lists number of (+) results under each search scenario. 010 No ads. It allows you to see what you searched for, so you don't wonder what the hell did I search for. 011 Easy to refine search. 012 Very easy to narrow search, saves time by picking keywords. 014 Drop down gives the ability to greatly refine search. 015 Easy to narrow down searches. 016 Design was simple. For injuries, the narrowing section was easy to use. Come back with a lot of articles that help to write the paper. Listed alternative searches. 017 Very easy to use and can refine searches very easily. Relevant websites came up for the most part. 018 Multiple search options. 022 The sentences under the topics pulled up allowed for fast identification if relevant in many cases. 024 See completed list of key words. (tool) 025 In searching with this search engine, I was able to change my search from having thousands of different resources to sift through, to just fewer thus saving time. 026 Easy to use - like automated refined search options. 027 The dropdown menu that had many choices. 031 Easy to use. 033 The alternative search function within the tool with addition of the combined control key words to the search words users typed in. 036 I really like how modifiable the searches are by category. 037 Refined searches on left side of page. Clear, clean page decorations. No ads or other distracting things on page. 038 On-hand visible search refinement tools. 039 This search tool did an excellent job in refining my searches. I was able to retrieve all documents that I need using one set of keywords (1 search only). 040 Clean, easy to read. 045 Accurate results. Good descriptions from each results.

224 051 I like that it offers the cached under each link. 054 Give more choice to refine searching. 055 The search agents that were relevant did help narrow the search. It returned a lot of relevant sites. 056 The search can be made wider. 057 Alternative search was most helpful - synonyms reworded what I was trying to search & produced just as helpful results as the original search. 059 It helps you refine your search terms both with synonyms & with further specification. 060 A list of possible search word synonym.

7. What aspects of this search tool are especially important?

Participant The Aspect(s) Especially Important 001 It shows me the original search & subsequent alternative searches, so I have a quick history reminding me what I've already tried. 008 It made me think of how I could to narrow down my search topic. 010 It is simple but too simple may not be a good thing. 011 To be able to dive deeper into your search & to help you come up with the right "trigger" words. 012 The ability to pick multiple key words and to have the option to search what was typed or the more general form (~word). 014 Using synonyms. 015 It allows you to see how you've narrowed down your search and compare the results (in numbers of finds) to your previous search. 016 The ease of use and the results it brought back. 017 The ease of use. 018 Relevance to inquiries. 022 I like the "alternative search" option, but did not use it because I am not accustomed to it. 024 Added alternative key words for search. 025 It was nice to be able to narrow my search to achieve fewer resources to go through that were more relevant to topic. 026 The alternative recommendations. 027 Being able to further pick the categories, thus made it easy to find relevant information. 031 N/A. 033 Fast response time. 036 It's very easy to use and very easy to manipulate in a way that provides meaningful additional information on more than one search. The whole setup is quite easy to look at or aesthetically pleasing. 037 Refined search. Well numbered, well organized display of information (headers & descriptions).

225 038 Easily adjusting your search. You can try to limit it in one or more areas easily to find a good combination. 039 Ease of use. Relevance of documents retrieved. 040 Finding related sources. 045 Accuracy. 051 The search bar & cached link. 054 Suggested refinements. 055 It gives good sources like PubMed. 056 Useful. 057 Important that the results are from reliable sources (like scientific journals and not Wikipedia). 059 Teaches users that different search terms bring different results (for better or worse). Many casual search engine users don't realize this. 060 Alternative search.

8. Additional comments

8.1 Additional comments from the Treatment Group only (8 comments)

Participant Additional Comments 010 The important articles are mixed with the not so important. I think if it some how would divide the list even for that and show “really important” and the “not so important”. 016 I like the name: Agent Smith. 022 I like having 40 results on one page. In Google, I have to flip through many more pages to get 40 (at least it seems that way). 026 I think it has a lot of potential! 027 I think one of the best parts of the search engine is that it takes into account all the words to find the original list and then let you narrow your search even more. 037 I definitely liked it and would use it to research on my own. 040 More options on search based on which sites we would like to retrieve data. (ex: ....com, ....gov, ....edu) 059 This would be a nice research tool/lesson for my English 110 students!

8.2 Additional comments from the Control Group only (8 comments)

Participant Additional Comments 003 Like that the system is simple and easy to use. Like that it only shows you 100 results because most people don't go past the 2nd page anyway. Would have liked to have used the enter button to search rather than using the mouse.

226 013 User interface needs improvement!! Looks very boring at the moment. Activate ENTER key to start the search. 019 Good job Agent Phil!!! :) 030 Description could have been used a little big fonts for easy read. 035 Surprised at how much relevant info it brought up instead of ads. 042 The first couple matches did not seem relevant to my paper, but the subsequent matches in the first 20 matches were better. 046 It worked ok, but did not seem to put the items in order of relevance like some search engines do. 047 The short description of what the site contained was helpful.

227