A Comparison of Methods for Collecting Web Citation Data for Academic Organisations
Total Page:16
File Type:pdf, Size:1020Kb
A comparison of methods for collecting web citation data for academic organisations Mike Thelwall Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK. E-mail: [email protected] Tel: +44 1902 321470 Fax: +44 1902 321478 Pardeep Sud Statistical Cybermetrics Research Group, School of Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, UK. E-mail: [email protected] Tel: +44 1902 328549 Fax: +44 1902 321478 The primary webometric method for estimating the online impact of an organisation is to count links to its web site. Link counts have been available from commercial search engines for over a decade but this was set to end by early 2012 and so a replacement is needed. This article compares link counts to two alternative methods: URL citations and organisation title mentions. New variations of these methods are also introduced. The three methods are compared against each other using Yahoo. Two of the three methods (URL citations and organisation title mentions) are also compared against each other using Bing. Evidence from a case study of 131 UK universities and 49 US Library and Information Science (LIS) departments suggests that Bing's Hit Count Estimates (HCEs) for popular title searches are not useful for webometric research but that Yahoo's HCEs for all three types of search and Bing’s URL citation HCEs seem to be consistent. For exact URL counts the results of all three methods in Yahoo and both methods in Bing are also consistent. Four types of accuracy factors are also introduced and defined: search engine coverage, search engine retrieval variation, search engine retrieval anomalies, and query polysemy.
Introduction One of the main webometric techniques is impact analysis: using web-based methods to estimate the online impact of documents (Kousha & Thelwall, 2007b; Kousha, Thelwall, & Rezaie, 2010; Vaughan & Shaw, 2005), journals (Smith, 1999; Vaughan & Shaw, 2003), digital repositories (Zuccala, Thelwall, Oppenheim, & Dhiensa, 2007), researchers (Barjak, Li, & Thelwall, 2007; Cronin & Shaw, 2002), research groups (Barjak & Thelwall, 2008), departments (Chen, Newman, Newman, & Rada, 1998; Li, Thelwall, Wilkinson, & Musgrove, 2005b), universities (Aguillo, 2009; Aguillo, Granadino, Ortega, & Prieto, 2006; Qiu, Chen, & Wang, 2004; Smith, 1999) and even countries (Ingwersen, 1998; Thelwall & Zuccala, 2008). The original and most widespread approach is to count hyperlinks to the object studied, normally using an advanced web search engine query. This search facility was apparently due to cease by 2012, however, since Yahoo!, the only major search engine to offer a link search service, was taken over by Microsoft (BBC, 2009), which had previously closed down the link search capability of its own search engine Bing (Seidman, 2007). More specifically, “Yahoo! has transitioned to Microsoft-powered results in the U.S. and Canada; additional markets will follow throughout 2011. All global customers and partners are expected to be transitioned by early 2012” (Yahoo, 2011a). This transition apparently includes phasing out link search because the linkdomain command stopped working before April 2011 in the US and Canadian version of Yahoo. For instance, the query linkdomain:wlv.ac.uk -site:wlv.ac.uk returned correct results in the UK version of Yahoo (April 13, 2011) but only webometrics pages containing the query term “linkdomain:wlv.ac.uk” in the US/Canada version (April 13, 2011). Moreover, Yahoo stopped supporting automatic searches, as normally used in webometrics, in April 2011 (Yahoo, 2011b), leaving no remaining automatic source of link data from search engines. Hence it is important to develop and assess online impact estimation methods as replacements for link searches. Several alternative online impact assessment methods have already been developed and deployed in various contexts. Link counts have been estimated using web crawlers rather than search engines (Cothey, 2004; Thelwall & Harries, 2004). A limitation of crawlers is that they can only be
1 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) used on a modest scale because of the time and computer resources needed. In particular, this approach cannot be used to estimate online impact from the whole web, which is the objective of most studies. Other projects have used two different types of search to estimate online impact: web mentions and URL citations. A web mention (Cronin, Snyder, Rosenbaum, Martinson, & Callahan, 1998) is a mention in a web page of the object being investigated, such as a person (Cronin et al., 1998) or journal article (Vaughan & Shaw, 2003). Web mention counts are estimated by submitting appropriate queries to a search engine. For instance the query might be a person’s name as a phrase search (e.g., “Eugene Garfield”). The drawback of web mention searches is that they are often not unique and therefore give some spurious matches (e.g., to Eugene Garfields other than the famous bibliometrician in the above case). Another drawback is that there may be multiple equivalent descriptions (e.g., “Gene Garfield”), which complicates the analysis. Finally, a URL citation is the mention of the URL of a web page or web site in another web page, whether accompanied by a hyperlink or not. URL citation counts can be estimated by submitting URLs as phrase searches to search engines. The principal disadvantage of URL citation counts is conceptual: including URLs in the visible text of web pages seems to be unnatural and so it is not clear that they are a reasonable source of online impact evidence, except perhaps in special cases like articles (see below). This article uses (web) organisation title mentions as a metric for organisations. This is really just new terminology for an existing measurement (web mentions) that has not been used in the context of organisations for impact measurements before, but has been used as part of a “Word co- occurrences” indicator of the similarity of two organisations (Vaughan & You, 2010), as described below. This study uses a set of UK universities and a set of US LIS departments to compare link counts, organisation title mention counts and URL citation counts against each other to assess the extent to which they have comparable outputs. It compares the results between Bing and Yahoo to assess the consistency of these search engines. It also introduces some methodological innovations. Finally, all the methods, including some methodological innovations, are incorporated into the free Webometric Analyst software (lexiurl.wlv.ac.uk, formerly known as LexiURL Searcher) to make them easily available for the webometric community.
Organisation title mentions as a Web impact measurement The concept of impact and methods for measuring it are central to this article, but both are subject to disagreement and opposition (see e.g., MacRoberts & MacRoberts, 1996; Seglen, 1998). The term impact seems to be typically used in bibliometrics either as a general term and almost synonymous with importance or influence, or as a technical term and synonymous with citation counts, as in the phrases citation impact or normalised citation impact (Moed, 2005, p.37, p. 221). Moed recommends using citation impact as a way to resolve the issue because it suggests the methodology used to evaluate impact (Moed, 2005, p. 221). In this way it could be read as a general or technical description. There seems to be a belief amongst bibliometricians that citation counts, if appropriately normalised, can aid or replace peer judgements of quality for researchers (Bornmann & Daniel, 2006; Gingras & Wallace, 2010; Meho & Sonnenwald, 2000), journals (Garfield, 2005), and departments (Oppenheim, 1995, 1997; Smith & Eysenck, 2002; van Raan, 2000), but that they do not directly measure quality because some high quality work attracts few citations and some poor work attracts many. Instead citations are indicators, essentially meaning that they are sometimes wrong. Moreover, in some contexts indicators may be of limited practical use. For example, journal citations may not help much when evaluating individual arts and humanities researchers in book-based disciplines. The idea behind citation analysis is that science is a cumulative enterprise and that in order to contribute to the overall advancement of science, research has to be used by others to help with new discoveries, and the prior work is acknowledged by citations (Merton, 1973). But research can also have an impact on society, for example in the form of commercial exploitation, informing government policy or informing/entertaining the public. Hence it could be argued that instead of counting just the citations to published articles, it might also be helpful to count the number of times a researcher or institution is mentioned – for example in the press or on the web (e.g., Cronin, 2001; Cronin & Shaw, 2002; Cronin et al., 1998). More generally, the “range of genres of invocation made possible by the Web should help give substance to modes of influence which have historically been backgrounded” (Cronin et al., 1998). As with motivations for citing (Bornmann & Daniel, 2008), there may be
2 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) negative and irrelevant mentions (for example, on the web) but this does not prevent counts of mentions from being useful or correlating with other measures of value. Link analysis for universities, although motivated by citation analysis, is not similar in the sense that hyperlinks to universities rarely directly target their research documents (e.g. published papers) but typically have more general targets, such as the university itself, via a link to a home page (Thelwall, 2002b). Nevertheless links to universities correlate strongly with their research productivity (e.g., Speaman’s rho 0.925 for UK universities: Thelwall, 2002a). Hence links to universities can be used as indicators for research but what they measure is probably a range of things, such as the extent of their web publishing, their size, their fame, the fame of their researchers, professional activities and their contribution to education (see e.g., Bar-Ilan, 2005; Wilkinson, Harries, Thelwall, & Price, 2003). Hence it seems reasonable to assess alternatives to inlinks: other quantities that are measurable and may reflect a wide range of factors related to the wider impact or fame of academic organisations. Two logical alternatives to hyperlinks are URL citations and title mentions. A URL citation is similar to a link except that the URL is visible in a web page rather than existing as a clickable link. It seems that moving from link counting to URL citation counting is a small conceptual step. An organisation title mention (or invocation or web citation) is the inclusion of the name of an organisation in a web page without necessarily linking to the organisation’s web site or including its URL in the page. From a theoretical perspective, this is possibly a weaker indicator of endorsement because links and URL citations contain navigation information, but titles do not. Nevertheless, a title mention seems to be otherwise similar to URL citations and links in the sense that all are invocations of the target organisation. As with mentions of individual researchers, organisation title mentions could capture types of influence that would not be reflected by traditional citation counts. There is a practical problem with organisation title mentions: whilst URLs are unambiguous and URL citations are almost unambiguous, titles are not (Vaughan & You, 2010). Therefore title- based web searches may in some cases generate spurious matches. For example, the phrase “Oxford University” also matches “Oxford University Press” but www.ox.ac.uk is unique to the university, even if this domain may host some external web sites.
Literature review This section discusses evidence for the strengths and weaknesses of the three methods discussed above. Most of the studies reviewed analyse a particular type of web site or web page and so their findings are typically limited in scope from the perspective of the current article.
Link counts Counts of links to web sites formed the original type of online impact evidence (Ingwersen, 1998). A variant of link counting, Google’s PageRank (Brin & Page, 1998) has been a high profile endorsement that links indicate impact but a number of articles have also explicitly tested this. In particular, counts of inlinks (i.e., hyperlinks originating in other web sites, sometimes called site inlinks (Björneborn & Ingwersen, 2004)) to university web sites correlate with research productivity for the UK (Thelwall & Harries, 2004), Australia (Thelwall, 2004) and New Zealand (Thelwall, 2004), and the same is true for some disciplines in the UK (Li, Thelwall, Wilkinson, & Musgrove, 2005a). More directly, counts of links to journal web sites correlate with journal Impact Factors (An & Qiu, 2004; Vaughan & Hysen, 2002) for homogenous journal sets but numbers of links pointing to a research web site are not a good indicator of researcher productivity (Barjak et al., 2007). In summary, there is good evidence that in academic contexts inlink counts are reasonable indicators of academic impact at larger units of aggregation than that of the individual researcher. However, in the case of university web sites the correlation between inlinks and research productivity is because more productive researchers produce more web content rather than because they produce better web content (Thelwall & Harries, 2004). Outside of academic contexts, the concept of “impact” is much less clear but links to commercial web sites have been shown to correlate with business performance measures (Vaughan, 2005; Vaughan & Wu, 2004) indicating that link counts are at least related to properties of the web site owner.
3 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Web mention counts Web mentions, i.e., the invocation of the name of a person or object, were introduced to identify web pages invoking academics as a way of seeking wider evidence of the impact, value or fame of individual academics (Cronin et al., 1998). Many of the web pages found in this study did indeed give evidence of the wider academic activities of the scholars (e.g., conference attendance). A web mention is a textual mention in a web page, typically of a document title or person’s name. Nevertheless a web mention encompasses any non-URL textual description. Web mentions can be found by normal search engine queries. Typically a phrase search may be used but additional terms may be added to reduce spurious matches. Web mentions were first extensively tested for journal articles (Vaughan & Shaw, 2003). Using searches for article titles and (if necessary to avoid false matches) subtitles and author names, web mentions (called web citations in the article) correlate with Social Sciences Citation Index citations, partially validating web mentions as an academic impact indicator (see also: Vaughan & Shaw, 2005). Web mentions have also been applied to identify online citations of journal articles in more specialised contexts: online presentations (Thelwall & Kousha, 2008), online course syllabuses (Kousha & Thelwall, 2008) and (Google) books (Kousha & Thelwall, 2009). In each case a significant correlation was found between Web of Science citation counts and web mentions for individual articles. Hence there is good evidence that web mentions work well as impact indicators for academic articles. Web mentions have also been assessed as a similarity metric for organisations. Vaughan and You (2010) compiled a list of 50 top WiMax or Long Term Evolution telecommunications companies to identify patterns of similarity by finding how often pairs of companies were mentioned on the same page. Company mentions were assessed through the company acronym, name or a short version of its name. Five companies had to be excluded due to ambiguous names (e.g., names also used by other, larger organisations). For each pair of companies, Google and Google Blogs searches were used to count how many pages mentioned both and the results were used to produce a multi-dimensional scaling diagram of the entire set. The results were compared with diagrams created by Yahoo co- inlink searches for the same set of companies. The patterns produced by all methods were consistent and matched the known industry sectors of the companies, giving evidence that counting co-mentions of company names was a valid alternative to counting co-inlinks for similarity analyses. Moreover, there was a suggestion that Google Blogs could be more useful than the main Google search, perhaps due to blogs containing less shallow material (Vaughan & You 2010). This study used a similar approach to the current paper, in the sense of comparing the results of title-based and link-based data, except that it counted co-mentions instead of direct mentions, only used one query for each organisation, did not directly compare the results for individual organisations (e.g., via a rank order test), did not use URL Citations, and used Yahoo, Google and Google blogs instead of Bing and Yahoo for the searches.
URL citation counts URL citations are identified with search engine queries using a full or partial URL as a phrase search (e.g., the query, “www.wlv.ac.uk”). These have only been evaluated for collections of journal articles, also with positive results (Kousha & Thelwall, 2006, 2007a). Hence, in this limited context, URL citations seem to be a reasonable indicator of academic impact. URL citations have also previously been used in a business context to identify connections between pairs of organisations. To achieve this, Google was used to identify, for each pair of organisations (A, B), the number of web pages in A that contained a URL citation to B. The results were used to construct network diagrams of the relationships between organisations (Stuart & Thelwall, 2006).
Comparisons of citation count methods Since there is evidence that all three measures may indicate online impact it would be reasonable to compare them against each other. One study used Yahoo to compare URL citations with inlinks for 15 web site data sets from previous webometric projects. The results showed a significant positive
4 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) correlation between inlinks and URL citations in all cases, with the Spearman coefficient varying from 0.436 (URLs of sites linking to an online magazine) to 0.927 (European universities). It was found that URL citations were typically less numerous than inlinks outside of academic contexts and were much less numerous when path information (i.e., sections of the URL after the domain name, such as www.bt.com/contact/ rather than www.bt.com) was included in the URLs (Thelwall, 2011, in press). Hence, URL citations seem to be particularly numerous for universities (which are academic and have their own domain name) but may be problematic for sets of smaller, non-academic web sites if some are likely not to have their own domain name. This is consistent with prior research indicating that URL citations were relatively rare in commercial contexts (Stuart & Thelwall, 2006). As the above review indicates, there is a lack of research evaluating web mentions and URL citations for data other than journal article collections and there is a lack of comparative research on other types of data, except for inlinks compared to URL citations.
Search engines in Webometric research Commercial search engines are a common data source for webometric research, although some studies use personal web crawlers instead. In consequence, the accuracy and consistency of search engine results has also formed a part of webometrics. Two different types of search engine output can be used: complete lists of results, and hit count estimates (HCEs). Complete lists of results can be used if the total number of results is less than the maximum normally returned by search engines, which is 1,000. The total number of matching pages obtained can be extended by using the query splitting technique (Thelwall, 2008a), however, but this may not work well if there are more than 100,000 results, and it is slow. HCEs are the alternative choice if there are too many queries to get exact page counts for or if there are too many results per query. HCEs are the estimates at the top of search results pages about the total number of matching pages. It takes only one search per query to obtain a HCE but a full list of 1,000 pages takes 20 searches because the matching pages are delivered in sets of 50. Search engine application programming interfaces (APIs), as used by webometrics software to automatically submit queries, are rate-limited to 5,000 (Yahoo before it closed in April 2011) or 10,000 (Bing) queries per day and for large webometrics exercises the additional up to 19 queries per search can result in impractically many queries to get full sets of results and so HCEs are used instead. This is particularly relevant when queries are used to get data for network diagrams because one search is needed per pair of web sites, so the number of queries needed scales with the square of the number of sites involved. In terms of the accuracy of search engine results: it has long been known that search engines do not give comprehensive results since they do not index the whole web (Lawrence & Giles, 1999) and results may fluctuate unpredictably over time (Bar-Ilan, 1999, 2001; Lewandowski, Wahlig, & Meyer-Bautor, 2006; Rousseau, 1999). Moreover, they do not always return complete lists of matches from their own index (Bar-Ilan & Peritz, 2008; Mettrop & Nieuwenhuysen, 2001) due, amongst other factors, to spam removal and restrictions on the number of matches returned from the same web site (Gomes & Smith, 2003). Moreover, the HCEs can vary between results pages for the same query, which seems to be due to filtering unwanted pages from the results in stages rather than at the time of submission of the original query (Thelwall, 2008b). Result stability may also vary in unpredictable ways in different types of query (Uyar, 2009a, 2009b). In summary, search engine results should be treated with caution, the HCEs and total number of results returned should normally be underestimates of the total number of matching pages on the web, and there is no reason to believe that different search engines would give similar results for the same queries.
Research questions The objectives of this study are to compare all of the main web citation methods for assessing the impact of academic organisations. Are URL citation counts, organisation title mention counts and inlink counts approximately equivalent in Bing (excluding inlink counts) and in Yahoo? Are URL citation HCEs, organisation title mention HCEs and inlink HCEs approximately equivalent in Bing (excluding inlink HCEs) and in Yahoo?
5 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Do Yahoo and Bing give approximately equivalent results for the same organisation title mention queries and for the same URL citation queries? In the above, the search engine Yahoo was chosen as the only major search engine still (in January 2011) allowing link searches. Bing was chosen because, like Yahoo, it is a major search engine and allows automatic query submission using an API. Google was not used because its search API (Google, 2011) only allows searching of a user-specified list of web sites, which is inadequate for the current study and for most webometric studies. Although the Google web interface can be used with manual human submission of queries, this greatly adds to the time needed to conduct a study and would also require copying the Google results for automatic processing for query splitting (see below) and duplicate elimination, a task that does not seem to have been attempted previously in any webometric study. Note also that at the time of data collection (Janary-February, 2011) Bing and Yahoo were both owned by Microsoft so the comparison between them is not ideal. Nevertheless, the results below show that they perform far from identically in the tests.
Methods This research uses two academic case studies: large sites for HCEs and small sites for counts of URLs. The large web sites data set is a collection of 131 UK universities, with names and URLs taken from http://cybermetrics.wlv.ac.uk/database/university_lists.htm (December 23, 2010) and subsequently checked. This gives a data set of relatively large academic web sites. The small web sites data set is a collection of US library and information science departments, with names and URLs taken from the 2009 US News rankings at http://grad-schools.usnews.rankingsandreviews.com/best- graduate-schools/top-library-information-science-programs/library-information-science-rankings (February 10, 2011). Both of these choices are relatively arbitrary but should give insights into the answers to the research questions. For the URL citation searches, the domain name (universities) or full URL (departments) was placed in quotes, after excluding the initial http://, and used as the main query. Note that the searches always exclude all matches from the site of the owning organisation, using the exclusion operator, -, and the web site match command, site:. Hence, for the University of Wolverhampton the query would be "wlv.ac.uk" -site:wlv.ac.uk. The searches were submitted automatically via the APIs of Bing and Yahoo, using Webometric Analyst. Search engine APIs do not return results that are identical to the web interface (McCowan & Nelson, 2007), but they are broadly similar and it is the APIs that are typically used for webometrics research and therefore are the most appropriate choice of data source. The organisation title searches were constructed as above, but with the organisation title in quotes replacing the URL or domain name (e.g., "University of Wolverhampton" -site:wlv.ac.uk). The LIS departments typically had names that included multiple parts that might not be mentioned together so a full title search was not useful. For these, the constituent parts were separated and each part was placed in quotes. For instance, in the following example the university name and department name were separated: "Louisiana State University" "School of Library and Information Science" -site:lsu.edu site:edu. For the LIS departments, site:edu was added to the end of each query so that all results came (almost) exclusively from US educational institutions to ensure that the total number of matches did not reach much above 1,000. This was a practical consideration to test the method effectively rather than a generic recommended procedure. Query splitting was also used for any query with over 850 results to look for additional matches. The query splitting technique produces additional matches above the normal maximum of 1,000 but causes many additional queries to be submitted and probably produces less complete lists of results than the results for a query returning less than 1,000 matches. If the purpose of the paper was to estimate the impact of the departments as accurately as possible, rather than to test the methods in a reasonable setting, then the site:edu part would not have been added and the limitation of less complete results and additional queries to be submitted would have been accepted. For the UK universities, site:ac.uk was not added because the analysis for these is based upon HCEs so additional restriction was not necessary. The inlink searches were constructed as for the URL citation searches except that the linkdomain: command was used instead of quotes around the domain name for the universities (e.g., 6 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) linkdomain:wlv.ac.uk –site:wlv.ac.uk) to get links to anywhere in the university web site. For the LIS departments, the link: command was used instead of the linkdomain command to get all links to just the home page (e.g., link:http://www.slis.ua.edu -site:ua.edu site:edu). This ensured that the query did not get too many matches but meant that the results of the link searches for LIS departments were not fully compatible with the other two types of searches, which matched any reference to the department. This inlink restriction was made because some of the departments did not have their own domain name and so the linkdomain: command could not be used with them. This kind of limitation seems to be common in small webometric studies and so the option of removing departments without their own domain name was not considered. A small methodological innovation has been introduced in this paper: the automatic inclusion of multiple queries for the same organisation within the data collection software (Webometric Analyst). Hence if an organisation has multiple names or web sites then these are automatically queried separately and the results (HCEs or URL lists) are automatically combined (added or merged, respectively). For instance the Graduate School of Library and Information Science at Queens College, City University of New York has two alternative home page URLs: http://qcpages.qc.edu/GSLIS/ and http://qcpages.qc.cuny.edu/GSLIS/. In order to get comprehensive link counts, searches to both of them were calculated and the results combined. For URL lists, Webometric Analyst combines the searches and eliminates duplicate URLs. For searches with over 1,000 matches, the query splitting used (see above) gets additional matches beyond the normal search engine limit of 1,000, allowing the combining operation to proceed. Note that for HCEs the addition of the results of different searches is problematic because duplicate elimination is not possible; this is a limitation of the method. In Webometric Analyst, a pipe character | separates searches that are to be submitted separately but have results that will be combined. Hence the input to Webometric Analyst was the following for this query, all on a single line. link:http://qcpages.qc.edu/GSLIS/ -site:qcpages.qc.edu/GSLIS/ -site:qcpages.qc.cuny.edu/GSLIS/ site:edu| link:http://qcpages.qc.cuny.edu/GSLIS/ -site:qcpages.qc.edu/GSLIS/ -site:qcpages.qc.cuny.edu/GSLIS/ site:edu The same queries were used for Yahoo and Bing, via Webometric Analyst, except that the link searches were not submitted to Bing because it does not support link searching. For the US LIS searches, the exact number of matching URLs and domains were calculated with Webometric Analyst's Reports function. The number of domains is calculated by extracting the domain name of each matching URL and then counting the number of unique matching domain names. Domains were counted because domain or site counts are frequently used in webometrics research instead of URL counts. Spearman correlations were used to assess the degree of agreement between results sets. There is no gold standard to compare the results against but the correlations can be used to assess the extent to which the different methods produce consistent results. Spearman correlations do not assess the volume of matches from each method but this is ultimately less important than the correlation because the latter reflects the rank order of the organisations. For each data set an external judgement of one aspect of their performance was chosen in lieu of a gold standard, for a check of external validity. For the US LIS departments, the 2009 US News rankings were chosen. These are based upon peer evaluations but are somewhat teaching-focussed since the publication is aimed at new students. For the UK, the Research Assessment Exercise (RAE) 2008 statistics were used to calculate a simple research productivity index. The RAE effectively gives each submitted academic from each university a score between 0 and 4 for their research quality. These scores were combined for each university with the formula n1+2n2+3n3+4n4, where ni is the number of academics scoring rating i. This is an oversimplification because a score of 4 could be worth many more times as much as four scores of 1 (e.g., in terms of research funding) but it would be difficult to justify any other specific weights and so this simple approach has been adopted. The result for each university may be reasonably described as an indicator of research productivity.
7 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Results The two case studies are reported separately, starting with the US LIS departments.
URL counts and domain counts: The case of US LIS Departments Tables 1 and 2 show the Spearman correlations for the web measures and the US News rankings of the departments. The five unranked departments (from Southern Connecticut State University, University of Denver, University of Puerto Rico, University of Southern Mississippi, Valdosta State University) were given the joint lowest rank as these were assumed to be not highly ranked departments. There is little difference overall between the URL counting measures and the domain counting measures, perhaps because both search engines tend to filter out multiple results from the same domain after the first two. The highest correlations are between the same measures from different search engines: URL cites and titles. The lower but positive correlations between different measures suggest that the three different measures are not interchangeable but have a large degree of commonality.
Table 1. Spearman correlations for the five URL counting web measures and US News rankings for the 49 US LIS departments*. URL counting Rank Yahoo Bing Yahoo Bing Yahoo inlinks titles titles URL URL cites cites US News Rank 1.00 -0.62 -0.72 -0.74 -0.64 -0.65 Yahoo inlinks 1.00 0.65 0.68 0.85 0.87 Bing titles 1.00 0.96 0.67 0.77 Yahoo titles 1.00 0.70 0.78 Bing URL cites 1.00 0.95 Yahoo URL cites 1.00 * All correlations are significant with p = 0.000.
Table 2. Spearman correlations for the five domain counting web measures and US News rankings for the 49 US LIS departments*. Domain counting Rank Yahoo Bing Yahoo Bing Yahoo inlinks titles titles URL URL cites cites US News Rank 1.00 -0.64 -0.70 -0.73 -0.64 -0.67 Yahoo inlinks 1.00 0.67 0.69 0.85 0.89 Bing titles 1.00 0.97 0.72 0.77 Yahoo titles 1.00 0.73 0.78 Bing URL cites 1.00 0.98 Yahoo URL cites 1.00 * All correlations are significant with p = 0.000.
HCEs: The case of UK universities Table 3 reports the correlations between the different HCEs for UK universities. These correlations are sometimes much lower than the comparable correlations for US LIS departments. The Bing title correlations are all low, suggesting that Bing should not be used for title search HCEs for large organisations or for large web sites. The remaining four measures all correlate highly with each other and with the RAE 2008 rankings, giving some confidence that they are reasonably interchangeable. There are some significant outliers, however. For example, there is a particularly high value of 66,633,700 for the total of the searches "ex.ac.uk" -site:ex.ac.uk -site:exeter.ac.uk site:ac.uk and "exeter.ac.uk" -site:ex.ac.uk -site:exeter.ac.uk site:ac.uk. The search was run repeatedly, giving varied but similarly high results and so it is not a one-off estimation anomaly but is related to the search itself. Some of the results for the first searches did not match ex.ac.uk but matched sussex.ac.uk instead, with the document parser splitting the word sussex before ex. This may account for the remaining 8 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) anomalously high values, especially because the domain sussex.ac.uk is not excluded from the searches. Adding –site:sussex.ac.uk to the searches gives 28,300,000 + 850,000 = 29,150,000 matches, which is still anomalously high, however.
Table 3. Spearman correlations for the five HCE web measures and RAE rankings for the 131 UK universities. HCE RAE200 Yahoo Bing Yahoo Bing Yahoo 8 inlinks titles titles URL URL cites cites RAE 2008 1.00 0.86* 0.13 0.88* 0.78* 0.88* Yahoo inlinks 1.00 0.06 0.91* 0.86* 0.94* Bing titles 1.00 0.16 0.08 0.03 Yahoo titles 1.00 0.84* 0.94* Bing URL cites 1.00 0.90* Yahoo URL cites 1.00 * significant at p = 0.001 (other correlations not significant at p = 0.05)
Discussion and limitations An important limitation of this research is that only one data set was used for each case and it is possible that different conclusions would be drawn from different data sets, particularly if they were associated with unusual use of links, URL citations or organisation titles. Hence it is recommended that future webometrics impact research should routinely compare the results of different methods and reject any method that gives inconsistent results for any reason. A second limitation is that the results were only analysed quantitatively and therefore no evidence was provided that the three methods (and the organisation title searches in particular) give results that are meaningful in the sense that the source pages correctly mention the target organisation in a positive context that makes the mentions worth counting. The significant correlations found with offline measures (US News and RAE 2008) provide some evidence to support the value of the results, however. In terms of the accuracy of the results of the searches, there seem to be four different types of problems, which are named and described in Table 4, together with advice for dealing with them. Note that the names differ from some used previously, such as technical relevance (Bar-Ilan & Peritz, 2009), because of a focus on a different task. For instance a URL is technically relevant to a query if it matches the query, whether or not the search engine returns it. This relates to the search engine coverage issue in Table 4 because not all technically relevant URLs are returned for the typical query. The issues in Table 4 all represent unavoidable limitations in webometrics research but in most cases steps can be taken to minimise their impact.
Conclusions The discussion above confirms that webometric studies have a number of unavoidable limitations in the accuracy of the results. If highly accurate results are needed then these are unlikely to be obtained from webometric methods. This limitation does not prevent webometric techniques from delivering results that are useful in some contexts, however. If indicators are acceptable, such as metrics that are known normally to correlate highly with relevant research measures, then the results suggest that all three metrics discussed here (inlinks, URL citations and organisation title mentions) can be suitable for research organisations. This was previously known for inlinks but not for the other two metrics. For exact counting, the correlations between the three metrics for URL and domain counts (tables 1 and 2) are all high enough to support a case that they could be used interchangeably for impact measurements, although there will be differences in their results. Moreover, the results of Bing and Yahoo have high correlations, suggesting that these are almost equivalent in practice. Although inlinks have been used primarily in webometric impact research, this suggests that URL citations and organisation title mentions (e.g., title searches) are appropriate replacements if link searches are withdrawn by Yahoo.
9 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) For HCEs, Bing is not recommended for title searches for large organisations or web sites because it gave inconsistent results that disagree with all Yahoo results and with URL citations in Bing. Bing also gave search engine retrieval anomalies for URL citation searches, although the results overall were consistent in the sense of correlating highly with results from Yahoo. One drawback with organisation title mentions is that the exact search to be used is ambiguous because organisations can be referred to with different equivalent names, and sub- organisations may be referred to in even more different ways in conjunction with the parent organisation (the query polysemy issue discussed above). Hence manual labour is needed to identify not only common name variations but also effective combinations of searches to capture the most common ways in which an organisation is described online. At the level of analysing results extra work is not needed, however, since Webometric Analyst has a facility to automatically combine the results of multiple searches and also to eliminate duplicates if URL or domain counting is used. This makes organisation title mentions a practical impact metric for webometrics.
Table 4. Sources of inaccuracy in search engine results Name and issue Examples Practical implications Search engine coverage The absolute numbers of The results should never be Search engines cover results varies substantially claimed to represent the “whole only a fraction of the between the search engines web”. The goal of webometric web and this fraction used. All results reported in studies should be to get the varies between search this paper are probably relative numbers of results engines. underestimates of the number between queries consistent, of matching web pages. rather than to get all results. Search engine retrieval The variation from a straight Use triangulation with other variation Due to spam line in the data may be partly methods or search engines to filtering (and estimation due to retrieval variation. decide if the variations are too algorithms for HCEs) large to use the data (e.g., Bing there will be variations Title HCEs). If exact ranking is in the extent to which important, consider combining the results retrieved multiple webometric sources in from a search engine an appropriate linear match the results known combination to smooth out to it. variations. Search engine retrieval Sussex.ac.uk being mistaken When possible, multiple methods anomalies Due to text for suss ex.ac.uk gave an or search engines should be processing algorithm anomaly in the ex.ac.uk triangulated together to help configurations, some results. identify anomalies and select an queries may return anomaly free search engine and many incorrect matches. method. Query polysemy A The query “Oxford The researcher should try to query may match University” matches Oxford construct one or more queries irrelevant documents. University Press. UCL that match most relevant matches University College documents and match few London and Université irrelevant documents. If this is Catholique de Louvain and is a impossible, manual filtering of commonly-used abbreviation results may be needed or queries for both. with low coverage used, in conjunction with a disclaimer.
References Aguillo, I. F. (2009). Measuring the institution's footprint in the web. Library Hi Tech, 27(4), 540- 556.
10 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Aguillo, I. F., Granadino, B., Ortega, J. L., & Prieto, J. A. (2006). Scientific research activity and communication measured with cybermetrics indicators. Journal of the American Society for Information Science and Technology, 57(10), 1296-1302. An, L., & Qiu, J. P. (2004). Research on the relationships between Chinese journal impact factors and external web link counts and web impact factors. Journal of Academic Librarianship, 30(3), 199-204. Bar-Ilan, J. (1999). Search engine results over time - a case study on search engine stability. Cybermetrics Retrieved January 26, 2006, from http://www.cindoc.csic.es/cybermetrics/articles/v2i1p1.html Bar-Ilan, J. (2001). Data collection methods on the Web for informetric purposes - A review and analysis. Scientometrics, 50(1), 7-32. Bar-Ilan, J. (2005). What do we know about links and linking? A framework for studying links in academic environments. Information Processing & Management, 41(4), 973-986. Bar-Ilan, J., & Peritz, B. C. (2008). The lifespan of 'informetrics' on the Web: An eight year study (1998-2006). Scientometrics, 79(1), 7-25. Bar-Ilan, J., & Peritz, B. C. (2009). A method for measuring the evolution of a topic on the Web: The case of “Informetrics”. Journal of the American Society for Information Science and Technology, 60(9), 1730-1740. Barjak, F., Li, X., & Thelwall, M. (2007). Which factors explain the web impact of scientists' personal home pages? Journal of the American Society for Information Science and Technology, 58(2), 200-211. Barjak, F., & Thelwall, M. (2008). A statistical analysis of the web presences of European life sciences research teams. Journal of the American Society for Information Science and Technology, 59(4), 628-643. BBC. (2009). Profiles: Yahoo, Google and Microsoft. BBC News, Retrieved February 28, 2011 from: http://news.bbc.co.uk/2011/hi/business/7222431.stm. Björneborn, L., & Ingwersen, P. (2004). Toward a basic framework for webometrics. Journal of the American Society for Information Science and Technology, 55(14), 1216-1227. Bornmann, L., & Daniel, H.-D. (2006). Selecting scientific excellence through committee peer review – A citation analysis of publications previously published to approval or rejection of post- doctoral research fellowship applicants. Scientometrics, 68(3), 427-440. Bornmann, L., & Daniel, H.-D. (2008). What do citation counts measure? A review of studies on citing behavior. Journal of Documentation, 64(1), 45-80. Brin, S., & Page, L. (1998). The anatomy of a large scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7), 107-117. Chen, C., Newman, J., Newman, R., & Rada, R. (1998). How did university departments interweave the Web: A study of connectivity and underlying factors. Interacting With Computers, 10(4), 353-373. Cothey, V. (2004). Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14), 1228-1238. Cronin, B. (2001). Bibliometrics and beyond: some thoughts on web-based citation analysis. Journal of Information Science, 27(1), 1-7. Cronin, B., & Shaw, D. (2002). Banking (on) different forms of symbolic capital. Journal of the American Society for the Information Science, 53(13), 1267-1270. Cronin, B., Snyder, H. W., Rosenbaum, H., Martinson, A., & Callahan, E. (1998). Invoked on the web. Journal of the American Society for Information Science, 49(14), 1319-1328. Garfield, E. (2005). The agony and the ecstasy: The history and the meaning of the journal impact factor. Fifth International Congress on Peer Review in Biomedical Publication, in Chicago, USA, Retrieved September 27, 2007: http://garfield.library.upenn.edu/papers/jifchicago2005.pdf. Gingras, Y., & Wallace, M. L. (2010). Why it has become more difficult to predict Nobel Prize winners: A bibliometric analysis of nominees and winners of the chemistry and physics prizes (1901-2007) Scientometrics, 82(2), 401-412.
11 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Gomes, B., & Smith, B. T. (2003). Detecting query-specific duplicate documents. United States Patent 6,615,209, Available: http://www.patents.com/Detecting-query-specific-duplicate- documents/US6615209/en-US/. Google (2011). JSON/Atom Custom Search API, Retrieved April 8, 2011 from: http://code.google.com/apis/customsearch/v1/overview.html Ingwersen, P. (1998). The calculation of Web Impact Factors. Journal of Documentation, 54(2), 236- 243. Kousha, K., & Thelwall, M. (2006). Motivations for URL citations to open access library and information science articles Scientometrics, 68(3), 501-517. Kousha, K., & Thelwall, M. (2007a). Google Scholar citations and Google Web/URL citations: A multi-discipline exploratory analysis. Journal of the American Society for Information Science and Technology, 58(7), 1055 -1065. Kousha, K., & Thelwall, M. (2007b). How is science cited on the web? A classification of Google unique web citations. Journal of the American Society for Information Science and Technology, 58(11), 1631-1644. Kousha, K., & Thelwall, M. (2008). Assessing the impact of disciplinary research on teaching: An automatic analysis of online syllabuses. Journal of the American Society for Information Science and Technology, 59(13), 2060-2069. Kousha, K., & Thelwall, M. (2009). Google Book Search: Citation analysis for social science and the humanities. Journal of the American Society for Information Science and Technology, 60(8), 1537-1549. Kousha, K., Thelwall, M., & Rezaie, S. (2010). Using the Web for research evaluation: The Integrated Online Impact indicator. Journal of Informetrics, 4(1), 124-135. Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the web. Nature, 400(6740), 107-109. Lewandowski, D., Wahlig, H., & Meyer-Bautor, G. (2006). The freshness of web search engine databases. Journal of Information Science, 32(2), 131-148. Li, X., Thelwall, M., Wilkinson, D., & Musgrove, P. B. (2005a). National and international university departmental web site interlinking, part 1: Validation of departmental link analysis. Scientometrics, 64(2), 151-185. Li, X., Thelwall, M., Wilkinson, D., & Musgrove, P. B. (2005b). National and international university departmental web site interlinking, part 2: Link patterns. Scientometrics, 64(2), 187-208. MacRoberts, M. H., & MacRoberts, B. R. (1996). Problems of citation analysis. Scientometrics, 36(3), 435-444. McCowan, F., & Nelson, M. L. (2007). Agreeing to disagree: Search engines and their public interfaces. Joint Conference on Digital Libraries, Retrieved June 18, 2007 from: http://www.cs.odu.edu/~fmccown/pubs/se-apis-jcdl2007.pdf. Meho, L. I., & Sonnenwald, D. H. (2000). Citation ranking versus peer evaluation of senior faculty research performance: A case study of Kurdish scholarship. Journal of American Society for Information Science, 51(2), 123-138. Merton, R. K. (1973). The sociology of science. Theoretical and empirical investigations. Chicago: University of Chicago Press. Mettrop, W., & Nieuwenhuysen, P. (2001). Internet search engines - fluctuations in document accessibility. Journal of Documentation, 57(5), 623-651. Moed, H., F. (2005). Citation analysis in research evaluation. New York: Springer. Oppenheim, C. (1995). The correlation between citation counts and the 1992 Research Assessment Exercise ratings for British library and information science universtiy departments. Journal of Documentation, 51, 18-27. Oppenheim, C. (1997). The correlation between citation counts and the 1992 Research Assessment Exercise ratings for British research in Genetics, Antomy and Archaeology. Journal of Documentation, 53, 477-487. Qiu, J. P., Chen, J. Q., & Wang, Z. (2004). An analysis of backlink counts and Web Impact Factors for Chinese university websites. Scientometrics, 60(3), 463-473.
12 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Rousseau, R. (1999). Daily time series of common single word searches in AltaVista and NorthernLight. Cybermetrics, 2/3, Retrieved July 25, 2006 from: http://www.cindoc.csic.es/cybermetrics/articles/v2002i2001p2002.html. Seglen, P. O. (1998). Citation rates and journal impact factors are not suitable for evaluation of research. ACTA Orthopaedica Scandinavica, 69(3), 224-229. Seidman, E. (2007). We are flattered, but... Bing Community, Retrieved November 11, 2009 from: http://www.bing.com/community/blogs/search/archive/2007/2003/2028/we-are-flattered- but.aspx. Smith, A., & Eysenck, M. (2002). The correlation between RAE ratings and citation counts in psychology. Retrieved April 14, 2005 from: http://psyserver.pc.rhbnc.ac.uk/citations.pdf. Smith, A. G. (1999). A tale of two web spaces; comparing sites using Web Impact Factors. Journal of Documentation, 55(5), 577-592. Stuart, D., & Thelwall, M. (2006). Investigating triple helix relationships using URL citations: A case study of the UK West Midlands automobile industry. Research Evaluation, 15(2), 97-106. Thelwall, M. (2002a). Conceptualizing documentation on the Web: An evaluation of different heuristic-based models for counting links between university web sites. Journal of American Society for Information Science and Technology, 53(12), 995-1005. Thelwall, M. (2002b). The top 100 linked pages on UK university Web sites: high inlink counts are not usually directly associated with quality scholarly content. Journal of Information Science, 28(6), 485-493. Thelwall, M. (2004). Methods for reporting on the targets of links from national systems of university Web sites. Information Processing and Management, 40(1), 125-144. Thelwall, M. (2008a). Extracting accurate and complete results from search engines: Case study Windows Live. Journal of the American Society for Information Science and Technology, 59(1), 38-50. Thelwall, M. (2008b). Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology, 59(11), 1702-1710. Thelwall, M. (2011, in press). A comparison of link and URL citation counting. ASLIB Proceedings, Retrieved February 28 from: http://cybermetrics.wlv.ac.uk/paperdata/LinkURLcitationPreprint.doc. Thelwall, M., & Harries, G. (2004). Do the Web sites of higher rated scholars have significantly more online impact? Journal of the American Society for Information Science and Technology, 55(2), 149-159. Thelwall, M., & Kousha, K. (2008). Online presentations as a source of scientific impact?: An analysis of PowerPoint files citing academic journals. Journal of the American Society for Information Science and Technology, 59(5), 805-815. Thelwall, M., & Zuccala, A. (2008). A university-centred European Union link analysis. Scientometrics, 75(3), 407-420. Uyar, A. (2009a). Google stemming mechanisms. Journal of Information Science, 35(5), 499-514. Uyar, A. (2009b). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469-480. van Raan, A. F. J. (2000). The Pandora's box of citation analysis: Measuring scientific excellence-The last evil? In B. Cronin & H. B. Atkins (Eds.), The Web of Knowledge: A Festschrift in Honor of Eugene Garfield (pp. 301-319). Medford, NJ: Information Today, Inc. ASIS Monograph Series. Vaughan, L. (2005). Exploring website features for business information. Scientometrics, 61(3), 467- 477. Vaughan, L., & Hysen, K. (2002). Relationship between links to journal Web sites and impact factors. ASLIB Proceedings, 54(6), 356-361. Vaughan, L., & Shaw, D. (2003). Bibliographic and Web citations: What is the difference? Journal of the American Society for Information Science and Technology, 54(14), 1313-1322. Vaughan, L., & Shaw, D. (2005). Web citation data for impact assessment: A comparison of four science disciplines. Journal of the American Society for Information Science & Technology, 56(10), 1075-1087.
13 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology) Vaughan, L., & Wu, G. (2004). Links to commercial websites as a source of business information. Scientometrics, 60(3), 487-496. Vaughan, L. & You, J. (2010). Word co-occurrences on Webpages as a measure of the relatedness of organizations: A new Webometrics concept. Journal of Informetrics, 4(4), 483–491. Wilkinson, D., Harries, G., Thelwall, M., & Price, E. (2003). Motivations for academic Web site interlinking: Evidence for the Web as a novel source of information on informal scholarly communication. Journal of Information Science, 29(1), 49-56. Yahoo (2011a). When will the change happen? How long will the transition take? Retrieved April 9, 2011 from: http://help.yahoo.com/l/us/yahoo/search/alliance/alliance-2.html Yahoo (2011b). Web Search APIs from Yahoo! Search. Retrieved April 13, 2011 from: http://developer.yahoo.com/search/web/webSearch.html Zuccala, A., Thelwall, M., Oppenheim, C., & Dhiensa, R. (2007). Web intelligence analyses of digital libraries: A case study of the National electronic Library for Health (NeLH). Journal of Documentation, 64(3), 558-589.
14 This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology copyright © 2011 (American Society for Information Science and Technology)