Analysis of Query Keywords of Sports-Related Queries Using Visualization and Clustering
Total Page:16
File Type:pdf, Size:1020Kb
Analysis of Query Keywords of Sports-Related Queries Using Visualization and Clustering Jin Zhang and Dietmar Wolfram School of Information Studies, University of Wisconsin—Milwaukee, Milwaukee, WI 53201. E-mail: {jzhang, dwolfram}@uwm.edu Peiling Wang School of Information Sciences, College of Communication and Information, University of Tennessee at Knoxville, Knoxville, TN 37996–0341. E-mail: [email protected] The authors investigated 11 sports-related query key- the user’s request includes the client Internet Protocol (IP) words extracted from a public search engine query log address, request date/time, page requested, HTTP code, bytes to better understand sports-related information seeking served, user agent, referrer, and so on. The data are kept in on the Internet. After the query log contents were cleaned and query data were parsed, popular sports-related key- a standard format in a transaction log file (Hallam-Baker & words were identified, along with frequently co-occurring Behlendorf, 2008). query terms associated with the identified keywords. Although a transaction log comprises rich data, including Relationships among each sports-related focus keyword browsing times and traversal paths, it is the queries directly and its related keywords were characterized and grouped submitted by users that have attracted the most research using multidimensional scaling (MDS) in combination with traditional hierarchical clustering methods. The two attention. Query data contain keywords that reflect users’ approaches were synthesized in a visual context by high- wide-ranging information needs. Query logs have been ana- lighting the results of the hierarchical clustering analysis lyzed from a variety of sources with different emphases and in the visual MDS configuration. Important events, peo- audiences. Studies of query characteristics have included ple, subjects, merchandise, and so on related to a sport special-topic Web sites such as THOMAS (Croft, Cook, & were illustrated, and relationships among the sports were analyzed. A small-scale comparative study of sports Wilder, 1995), a local search engine (Park, Lee, & Bae, 2005), searches with and without term assistance was con- a university Web site (Wang, Berry, & Yang, 2003), digital ducted. Searches that used search term assistance by libraries (Jones, Cunningham, & McNab, 1998; Mahoui & relying on previous query term relationships outper- Cunningham, 2000), a Web-based, online public-access formed the searches without the search term assistance. catalog (Cooper, 2001), and bibliographic databases (Yi, The findings of this study provide insights into sports information seeking behavior on the Internet. The devel- Beheshti, Cole, Leide, & Large, 2006) as well as public search oped method also may be applied to other query log engines such as Fireball (Hoelscher, 1998),AltaVista (Jansen, subject areas. Jansen, & Spink, 2005), Excite (Jansen, Goodrum, & Spink, 2000; Rieh & Xie, 2006; Silverstein, Marais, Henzinger, & Introduction and Previous Research Moricz, 1999; Spink, Jansen, Wolfram, & Saracevic, 2002; Spink, Wolfram, Jansen, & Saracevic, 2001), and Vivisimo Transaction log data analysis has been widely used over (Koshman, Spink, & Jansen, 2006) and federated search sys- the past few decades to better understand user searching. The tems such as Dogpile (Spink, Jansen, & Koshman, 2007). reasons for this are clear. Large quantities of search informa- Similarly, a number of studies have focused on specific search tion are faithfully recorded in a transaction log. It is, therefore, areas or circumstances. These have included search queries natural to tap into transaction logs to gain insights into users’ on mobile phones in Japan (Baeza-Yates, Dupret, & Velasco, search behavior. A log file is able to record the history of all 2007), employment-based searching (Jansen et al., 2005), online users’ requests. It accurately keeps and maintains multimedia searching on a public search engine (Goodrum & all online users’ activities performed on a server. Usually, Spink, 2001; Jansen et al., 2000), sexual topics (Spink, Ozmutlu, & Lorence, 2004), and health or medical-related Received May 21, 2008; revised March 15, 2009; accepted March 17, 2009 information (Spink, Yang, et al., 2004). © 2009 ASIS&T • Published online 13 May 2009 in Wiley InterScience Different approaches have been applied to the analysis (www.interscience.wiley.com). DOI: 10.1002/asi.21098 and reporting of query data. Early studies such as those JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(8):1550–1571, 2009 by Croft et al. (1995), Hoelscher (1998), and Silverstein were observed and analyzed, and their information search et al. (1999) have provided descriptive analyses of keywords, behavior patterns were generalized. Based on these stud- queries, and possibly, sessions. Ross and Wolfram (2000) ies, it is clear that although several studies have examined used hierarchical cluster analysis on query-keyword pairs for sports-related information on the Internet, few have focused multi-keyword queries for a transaction log from the Excite on sports-related information-seeking behavior. search engine to identify groups of topics. Beitzel, Jensen, Existing studies have demonstrated the complexity of rela- Chowdhury, Frieder, & Grossman (2007) argued that catego- tionships in how vocabularies are used to formulate queries. rization and classification of user queries can lead to increased Clustering or grouping techniques lend themselves to newer effectiveness and efficiency in general-purpose Web search exploratory methods such as information visualization. Infor- systems. They investigated properties of a very large query mation visualization was introduced to utilize the human log over varying periods and were able to identify and exam- perceptional capacity to present, understand, and explore ine topical trends. Shi and Yang (2007) developed a method complex abstract information by using computing techniques to assist users in formulating initial queries using associa- (Robertson, Card, & Mackinlay, 1989). It is widely rec- tion rules to identify related queries from a Web log. In a ognized that the human perception system is responsible similar study, Huang, Chien, and Oyang (2003) were able to for not only receiving outside information but also process- suggest keywords for assisting user interactive searches. Rel- ing received information. The way information is processed evant keywords suggested for a user query came from those by the human perception system is unique, effective, and that co-occurred in similar query sessions in a transaction efficient. According to Zeki (1992), the four parallel sys- log. Beeferman and Berger (2000) applied an agglomera- tems within the human visual cortex work simultaneously tive clustering algorithm to search queries and relevant Web to process received visual inputs from the retina. pages in a transaction log to discover potential query clusters. Information visualization is employed to reveal connec- Likewise, Wen, Nie, and Zhang (2001) studied the rela- tions and relationships among investigated objects. It is tionship between a search query and selected Web pages believed that information visualization can be used to support in conjunction with query contents to categorize queries in tasks such as data analysis, information exploration, infor- a transaction log. To visually analyze Web logs, Whittle, mation explanation, trend prediction, and pattern detection Eaglestone, Ford, Gillet, and Madden (2007) looked for sim- (Zhang, 2008). ilarities between queries and identified sequences of “query A traditional clustering analysis usually offers a set of sep- transformations,” which were represented as graphical net- arate and disconnected clusters. A visual-analysis method works, thereby providing a different view of search behavior. provides users with proximity characteristics of objects To facilitate transaction log analysis, Joshi, Joshi,Yesha, and that can be visually grouped into clusters; the connections Krishnapuram (1999) developed an ad hoc tool for analytic between an object and multiple relevant objects in a cluster; queries on a transaction log warehouse. a holistic overview of all involved objects and clusters; and Sports-related information-seeking behavior on the Inter- the contexts and degree to which clusters and objects are con- net has attracted the attention of several researchers. Ernest, nected and related. In addition, a visual display of clusters is Level, and Culbertson (2005) concluded that the Internet was vivid, straightforward, and intuitive. a preferred source for sports and other consumer information. There are many available information-visualization tech- Sports topics were identified as one of the most popu- niques and applications such as Pathfinder associative net- lar online Internet search-topic categories by Kelly (2006). works (Fowler, Fowler, & Wilson, 1991; Schvaneveldt, The role that the Internet plays for sports fans is a direct Durso, & Dearholt, 1989), which are appropriate for visu- extension of their interest in sports. People who like to alizing a simplified and optimized network for a sophisti- watch sports are more likely to purchase event tickets and cated network; self-organizing maps (SOMs; Kohonen, 2001; visit sports-related Web sites (SportsBusiness Daily, 2001). Kohonen et al., 2000), which categorize objects by presenting Due to the nature of sports, many studies of the sports-related them in a semantic map;