<<

Comparative Recall and Precision of Simple and Expert Searches in Google Scholar and Eight Other Databases William H. Walters

Portal: Libraries and the Academy This document is the final, published version of an article in , vol. 11, no. 4 (October 2011), pp. 971–1006.

It is also available from the publisher’s web site at http://muse.jhu.edu/login?auth=0&type=summary&url=/journals/port al_libraries_and_the_academy/v011/11.4.walters.html and at http://www.press.jhu.edu/journals/portal_libraries_and_the_academy/ portal_pre_print/archive/articles/11.4walters.pdf William H. Walters 971

Comparative Recall and Precision of Simple and Expert Searches in Google Scholar and Eight Other Databases

William H. Walters

abstract: This study evaluates the effectiveness of simple and expert searches in Google Scholar (GS), EconLit, GEOBASE, PAIS, POPLINE, PubMed, Social Sciences Citation Index, Social Sciences Full Text, and Sociological Abstracts. It assesses the recall and precision of 32 searches in the field of later-life migration: nine simple keyword searches and 23 expert searches constructed by demography librarians at three top universities. For simple searches, Google Scholar’s recall and precision are well above average. For expert searches, the relative effectiveness of GS depends on the number of results users are willing to examine. Although Google Scholar’s expert-search performance is just average within the first fifty search results, GS is one of the few databases that retrieves relevant results with reasonably high precision after the fiftieth hit. The results also show that simple searches in GS, GEOBASE, PubMed, and Sociological Abstracts have consistently higher recall and precision than expert searches. This can be attributed not to differences in expert-search effectiveness, but to the unusually strong performance of simple searches in those four databases.

ince its introduction in November 2004, Google Scholar (GS) has risen to promi- nence as a major bibliographic database. In a recent survey of more than 3,000 faculty, Google and Google Scholar were together identified as the third most Scommon mechanism for finding information in academic journals, after “searching electronic databases” and “following citations from other journal articles.”1 Nonethe- less, many information professionals have been reluctant to provide systematic access to Google Scholar. In 2005, just 24 percent of North American research libraries included GS in their online database lists, and fewer than twenty percent listed it as a recom- mended internet search engine. Two years later, just 32 percent of OhioLINK libraries portal: Libraries and the Academy, Vol. 11, No. 4 (2011), pp. 971–1006. Copyright © 2011 by The Johns Hopkins University Press, Baltimore, MD 21218. 972 Comparative Recall and Precision

mentioned GS on their web sites.2 Although most doctoral research universities now provide access to GS through their web sites and link resolvers, most bachelor’s- and master’s-level institutions do not.3 Librarians’ lack of enthusiasm for Google Scholar may stem, at least partly, from the unconventional methods used to build the database. While most bibliographic databases index particular journals in their entirety, Google Scholar’s coverage is es- sentially document- and publisher-based. Specifically, GS gets its bibliographic records from three sources: (1) freely available web documents that “look scholarly” in their content or format; (2) articles or documents supplied by Google Scholar’s partner agencies: journal publishers, scholarly societies, database vendors, and academic institutions; (3) citations extracted from the reference lists of previously indexed documents.4

Only the records supplied by Google Scholar’s partner agencies are likely to provide consistent coverage of particular journals, and even that journal coverage is not truly comprehensive. As several authors have noted, GS does not index every article avail- able through partner agencies’ web sites.5 Librarians’ reluctance to guide patrons to Google Scholar may also result from a For many purposes, the methods misunderstanding about the database content. Some may regard GS as a subset used to build the database are less of Google, when in fact only records important than the bottom line: Are of type (1) can be found through the regular Google interface. Google Scholar searches effective in For many purposes, the methods identifying relevant documents? used to build the database are less im- portant than the bottom line: Are Google Scholar searches effective in identifying relevant documents? Previous research has shown that for simple keyword searches, GS performs well in comparison with conventional bibliographic databases. Within the field of later-life migration, for example, GS indexes 93 percent of the relevant literature and achieves high recall and precision when a simple keyword search phrase is used.6 Because the GS interface is relatively unsophisticated, however, we might expect that other databases will perform better than GS for more complex searches that draw on expert knowledge and take full advantage of the search features available within each database. This study investigates that possibility, examining Google Scholar’s ef- fectiveness as a tool for serious research. Specifically, it evaluates the performance of GS and eight other databases within the field of later-life migration, an interdisciplinary research area that encompasses elderly migration, retirement migration, post-retirement migration, and related types of geographic mobility. The primary objective of the study is to determine whether GS maintains high recall and precision when expert, rather than simple, searches are conducted. A secondary objective is to identify the databases for which simple or expert searching is especially effective—to explore why expert searching increases the effectiveness of some databases but not others. The paper presents three sets of comparisons: William H. Walters 973

(1) simple searches in GS versus those in the eight other databases; (2) expert searches in GS versus those in the eight other databases; (3) simple versus expert searches within each of the nine database.

The first comparison confirms the results of an earlier study.7 The second comparison is based on expert searches constructed by the demography librarians at three major research universities. The third comparison looks at both sets of search results, compar- ing simple and expert searches within each database.

Context and Previous Research

Early Reviews of Google Scholar Early reviews of GS were generally negative, emphasizing its lack of controlled vo- cabulary and subject headings, its inconsistency in reporting author names and journal titles, its idiosyncratic handling of Boolean operators, and the absence of mechanisms for sorting, marking, manipulating, and exporting search results.8 Several early studies also reported that GS retrieved a relatively high proportion of non-scholarly documents. Joann M. Wleklinski, searching for information on the political scientist Ithiel de Sola Pool, found relatively many citations to sources that were neither peer-reviewed nor authoritative. Susan Gardner and Susanna Eng reached a similar conclusion when searching for papers on homeschooling in GS and three other databases: “There is more variety in Google Scholar and a higher number of results, but they are not necessarily as scholarly or relevant.” Likewise, Burton Callicott and Debbie Vaughn reported that Google Scholar’s retrieval rate, based on the first 100 hits, was somewhat lower than that of EBSCO Academic Search Premier.9 Although serious problems persist—in particular, idiosyncratic search behavior and incomplete or inaccurate bibliographic records10—recent investigations have reported favorably on Google Scholar’s coverage of the scholarly literature. This can perhaps be attributed to the more systematic nature of recent studies and to improvements in GS that have been made over the past few years. Recent evidence also suggests that undergraduates tend to prefer GS over conventional bibliographic databases due to its simplicity, its speed, and its similarity to Google.11

Coverage (Content) of the Google Scholar Database Investigations of Google Scholar’s coverage generally use author or title searches to de- termine whether particular articles are indexed by GS. These studies are concerned not with the effectiveness of the search mechanism, but with the content of the database itself. In the earliest large-scale analysis of this type, Chris Neuhaus and associates gener- ated a random sample of 2,350 journal articles from 47 bibliographic databases, then calculated the proportion of the articles for which records could be found in Google Scholar. GS included citations for sixty percent of the articles, although the results varied dramatically by subject area. On average, GS provided 76 percent coverage in the natural sciences but only 41 percent coverage in education, 39 percent coverage in the social sci- ences, and ten percent coverage in the humanities.12 In a similar investigation, Marilyn Christianson searched for articles published in the top journals, reporting that 974 Comparative Recall and Precision

GS included full or partial citations for 89 percent of the 840 sampled articles. Likewise, GS indexed 66 percent of 960 engineering articles selected from Compendex.13 Finally, Philipp Mayr and Anne-Kathrin Walter searched GS for journal titles, counting each journal as indexed if GS included records for one or more articles from the journal. They found that GS provided thorough coverage of the journals covered by Social Sciences Citation Index (88 percent), Science Citation Index (86 percent), and Arts & Humani- ties Citation Index (81 percent) but less complete coverage of Open Access journals (68 percent) and German-language sociology journals (70 percent).14 Other investigations have adopted a comparative approach, evaluating Google Scholar’s coverage (content) relative to that of other bibliographic databases. Six such studies have been published: (1) Walters conducted searches for 155 core articles in the field of later-life migration, reporting that GS indexed 93 percent of that literature—far more than AgeLine, ArticleFirst, EBSCO Academic Search Elite, GEOBASE, POPLINE, Social Sciences Abstracts, or Social Sciences Citation Index (SSCI). Moreover, GS provided more uniform publisher and date coverage than any other database.15 (2) Lyle Ford and Lisa Hanson O’Hara searched for 39 articles in the natural sci- ences, reporting that GS indexed 72 percent of them. indexed 49 percent and Windows Live Academic Search indexed just 41 percent.16 (3) Michael Norris, Charles Oppenheim, and Fytton Rowland searched for nearly 1,000 articles from economics, ecology, and sociology journals that were indexed by SSCI or Science Citation Index (SCI) and known to be freely accessible online. GS indexed more than 68 percent of the articles, far more than OpenDOAR (eleven percent) and OAIster (two percent).17 (4) Jared L. Howland and associates searched for 133 articles in the field of continu- ing education. GS indexed 53 percent of the articles. ERIC indexed 24 percent, and no other database (Academic OneFile, EBSCO Academic Search Premier, EBSCO Professional Development Collection, Education Full Text, or ProQuest Research Library) indexed more than twelve percent.18 (5) Hannah Rozear undertook author searches for 470 articles by twelve art historians, reporting that GS indexed only 33 percent of the relevant articles—significantly fewer than Arts & Humanities Citation Index (73 percent), Bibliography of the History of Art (56 percent), and Art Full Text/Art Index Retrospective (41 per- cent).19 (6) Susanne Mikki searched GS and SCI/SSCI for papers by 29 authors in climatol- ogy, petroleum geology, and related fields. GS retrieved citations for more than 85 percent of the articles found in SCI/SSCI. In contrast, SCI/SSCI retrieved citations for fewer than a third of the articles found in GS.20

Together, these studies demonstrate that GS provides good coverage of the natural and social sciences but less comprehensive coverage of fields such as art history. Of course this may reflect the relatively low number of papers in the humanities that are available online.21 Two studies demonstrate that the documents indexed by GS are no lower in qual- ity than those found through conventional bibliographic databases. Rena Helms-Park, William H. Walters 975

Pavlina Radia, and Paul Stapleton examined the information resources selected by un- dergraduates who used GS in a first-year English course. Assessing each document in terms of authority, objectivity, rigor, and transparency, they reported no difference Together, these studies demonstrate in quality between the documents found that GS provides good coverage of through GS and those found through conventional bibliographic databases.22 In the natural and social sciences but a similar study, Jared Howland and asso- less comprehensive coverage of ciates asked librarians at Brigham Young fields such as art history. University to assign scholarliness scores to the documents retrieved by seven GS searches and seven comparable searches in conventional library databases. Each score was based on the accuracy, authority, coverage, currency, objectivity, and relevance of the document. Articles found in both databases had the highest average score: 14.2 on a scale of 6 to 18. Those found only in GS had a comparable average score (14.0), while those found only in conventional library databases were significantly “less scholarly,” with an average score of just 11.9.23

Effectiveness of Subject Searches in Google Scholar A second line of investigation examines Google Scholar’s search effectiveness—the number of relevant documents actually retrieved by subject or keyword searches rather than the number included, and potentially retrievable, within the database. Five papers have systematically evaluated the effectiveness of subject searches in GS: (1) Searching for information on Nodilittorina, a type of periwinkle, D. Yvonne Jones found that GS retrieved a greater number of relevant records than each of nine other databases. Only BIOSIS had a higher retrieval rate, and Google’s Scholar’s coverage of the most recent literature was especially impressive.24 (2) Mary L. Robinson and Judith Wusteman ran searches on four topics in the life and environmental sciences. Examining the first ten results of each search, they found that GS had higher recall and higher precision than Yahoo!, Google, and Ask.com.25 (3) Glenn Haya, Else Nygren, and Wilhelm Widmark compared GS with Metalib, a federated search tool representing more than 200 databases available at Stockholm University. Thirty-two undergraduates, chiefly in the social sciences, conducted searches for their thesis research. The resulting citations were counted as relevant if the students saved them for future use during their 20-minute sessions with GS and Metalib. On average, the students found more than twice as many relevant documents in GS as in Metalib. Students without prior training found relatively more peer-reviewed articles in Metalib, although students who had been trained in the use of both search tools found more in GS.26 (4) Michael Levine-Clark and Joseph Kraus conducted two subject searches, two searches for chemical compounds, and one name as subject search in GS and Chemical Abstracts. They found 1.4 times as many citations in GS but made no attempt to assess the relevance of the documents retrieved.27 976 Comparative Recall and Precision

(5) Focusing on the literature of later-life migration, Walters evaluated the results of simple keyword searches in 12 databases. Although the performance of each database varied with the number of search results examined, GS consistently ranked among the top four databases in terms of both recall and precision.28

Based on these findings, we can conclude that GS is relatively effective in terms of recall and precision within the natural and social sciences. Only BIOSIS, MEDLINE, and SSCI show signs of matching or exceeding Google Scholar’s overall performance. As noted earlier, all previous studies of Google Scholar’s search effectiveness have been based on relatively simple searches. Jones used a one-word search term (Nodilit- torina), and Haya and associates relied on the searches conducted by undergraduates. Levine-Clark and Walters used simple search phrases such as hypericin AND epr and elderly migration. Only the searches composed by Robinson and Wusteman used Boolean logic to a significant extent: e.g.,river AND AND (restoration OR rehabilita- tion OR sediment OR erosion). Previous studies have not drawn on expert knowledge or made full use of the search features available within each database.

Methods

Identification of Relevant Records This study is based on searches for the core literature of later-life migration, a field which includes topics such as the spatial patterns of migration, the impact of personal attributes on mobility, the impact of place characteristics on migration, and the consequences of migration for individuals and communities. Later-life migration was chosen due to its multidisciplinary scope, This approach, based on the idea that rel- its inclusion in several major social science databases, and evance comprises more than just appropri- its appropriateness as a topic ateness of subject, acknowledges that some for both simple and expert papers are far more important than others29 searches—both student term papers and faculty research. and that the most relevant documents are The records retrieved by those with the greatest potential impact— each search qualified as rel- evant only if they could be those that present novel, meaningful, and found within a set of 155 docu- well-supported findings of relevance to ments identified in advance: scholars and practitioners. the most important journal articles on later-life migration published from January 1990 to December 2000. This approach, based on the idea that relevance comprises more than just appropriateness of subject, acknowledges that some papers are far more important than others29 and that the most relevant documents are those with the greatest potential impact—those that present novel, meaningful, and well-supported findings of relevance to scholars and practitioners. William H. Walters 977

Specifically, the relevant records consist of the 155 journal articles described in “Later-Life Migration in the United States,” a bibliographic essay that has been cited more than fifty times since its publication in theJournal of Planning Literature.30 The 155 articles were selected through a process that required (1) the use of comprehensive database searching, citation tracing, journal browsing, and consultation with colleagues to identify every paper on later-life migration published from 1990 to 2000; (2) the careful reading of more than 500 potentially relevant articles; (3) the evaluation of those articles on the basis of subject relevance, importance of findings, innovativeness of methods or approach, number of other studies on the topic, accessibility of content (readability), and accessibility of the document itself (availability to students and scholars). The evaluation of potentially relevant articles was conducted without knowledge of the database(s) in which each study was found. Nonetheless, the evaluations reflect the judgments of a single reviewer and were not checked by others for reliability.

Simple Searches

Table 1 Databases Evaluated

Database Platform

Google Scholar (GS) http://scholar.google.com/advanced_scholar_search EconLit EBSCOhost GEOBASE OCLC FirstSearch PAIS International (PAIS) CSA Illumina POPLINE http://www.popline.org/expert.html PubMed http://www.ncbi.nlm.nih.gov/pubmed/advanced Social Sciences Citation Index (SSCI) Web of Science Social Sciences Full Text (SSFT) WilsonWeb Sociological Abstracts (SocAbs) CSA Illumina

Note. The CSA Illumina interface, rather than EBSCOhost, was used for one of the three expert searches in EconLit. The results for POPLINE, PubMed, and SSFT were sorted by date (most recent first) rather than by relevance. 978 Comparative Recall and Precision

Nine simple keyword searches were conducted, one in each of the databases shown in Table 1.31 The databases were selected for their coverage of disciplines that are well-rep- resented within the literature of later-life migration.32 While six of them are conventional subscription databases, three (GS, POPLINE, and PubMed) are freely available online. In every database but PubMed, the search was a keyword phrase, elderly migration, entered into the most basic search interface that permitted the appropriate date restriction. A phrase search (“elderly migration”) was used in PubMed, since the standard keyword search returned a large number of extraneous hits. Table 2 shows the simple searches performed in GS, PubMed, SSCI, and Sociological Abstracts.

Table 2 Examples of Simple Searches

Google Scholar Interface: http://scholar.google.com/advanced_scholar_search With all of the words: elderly migration Where my words occur: anywhere in the article Search articles in all subject areas; include patents Limited by date, 1990 to 2000 About 30,200 hits

PubMed Interface: http://www.ncbi.nlm.nih.gov/pubmed Search: “elderly migration” Limited by date, 1990 to 2000 18 hits

Social Sciences Citation Index Interface: Web of Science Search Search: Topic=(elderly migration) Limited by date, 1990 to 2000 117 hits

Sociological Abstracts Interface: CSA Illumina Advanced Search Search: KW=(elderly) and KW=(migration) (equivalent to a simple search for elderly migration) Limited by date, 1990 to 2000 111 hits William H. Walters 979

The results of each search were sorted by relevance, if possible. Relevance sorting was not available in POPLINE or PubMed, so those results were sorted by publication date, instead. The results for Social Sciences Full Text were also sorted by date, since relevance was not an effective sort field in that database; every hit had a relevance score of 100 percent.

Expert Searches Twenty-three expert searches were constructed by demography librarians at the Uni- versity of Michigan, Penn State University, and Princeton University. Each librarian was asked to search every database shown in Table 1. Not every database was available at every institution, however, so the analysis of expert searches is based on one search in GEOBASE, two searches in PAIS, two in Social Sciences Full Text, and three in each of the other six databases. The three universities were chosen due to their status as major centers for research in demography and related fields. All three are members of APLIC, the association for population and family planning libraries, and two of the three are ranked among the top five U.S. graduate programs in the sociology of population. Both Michigan and Princeton are supported by the National Institute on Aging as Centers on the Demography and Economics of Aging.33 Three demography librarians participated in the project as expert searchers: Joann Donatiello, Population Research Librarian at the Donald E. Stokes Library of Princeton University; Tara Murray, Information Core Director at the Population Research Insti- tute of Penn State University; and Darlene Nichols, Coordinator of Graduate Library Instruction and Subject Librarian for Demography at the University of Michigan. Each was sent the following information:

[This project] will assess the effectiveness of various bibliographic databases within the area of later-life migration. The study will compare the effectiveness of simple and expert searches in each of [several] databases. . . . Within each database, I’d like you to retrieve bibliographic records that represent the most important journal articles on U.S. and Canadian later-life migration published between January 1990 and December 2000 (inclusive). Ideally, the search results will include a high proportion of relevant articles and a low proportion of non-relevant articles.

The instructions also made it clear that international migration, other than migration between the U.S. and Canada, was outside the scope of the search, and that documents other than journal articles would not be considered relevant. The searchers were ad- vised to use print or online subject thesauruses, if they desired; to make full use of any helpful functions or features available through the search interfaces; and to perform as many preliminary searches as they liked. They were asked not to consult bibliographies or literature reviews, however, and to spend no more time on each search than the maximum amount of time they’d devote to helping a faculty member or student. The searchers were not told how relevance would be assessed and were not aware that a list of relevant documents had been prepared in advance. For each database, the searchers were asked to submit detailed information “on the one search that best retrieves the most important journal articles on U.S. and Canadian 980 Comparative Recall and Precision

later-life migration published between January 1990 and December 2000.” Because each search was tailored to the characteristics of a particular database, each searcher/database combination was unique. Table 3 shows examples of the expert searches performed in GS, PubMed, SSCI, and Sociological Abstracts.34

Table 3 Examples of Expert Searches

Google Scholar Interface: http://scholar.google.com/advanced_scholar_search?hl=en&as_sdt=2000 With all of the words: (aged OR elderly OR “senior citizen” OR retire OR “late life”) (“geographic mobility” OR “relocation” OR “residential mobility” OR migration) (US OR “United States” OR canada OR sunbelt OR arizona OR florida OR minnesota) Where my words occur: anywhere in the article Search only articles in the following subject areas: Business, Administration, Finance, and Economics; Social Sciences, Arts, and Humanities Limited by date, 1990 to 2000 About 40,800 hits

PubMed Interface: http://www.ncbi.nlm.nih.gov/pubmed/advanced Search: (“elderly migration”[All Fields] OR “elderly migration patterns”[All Fields] OR “elderly migration rates”[All Fields] OR “retirement migration”[All Fields] OR “retirement migration decision making”[All Fields] OR “retirement migration patterns”[All Fields] OR “retirement migration process”[All Fields]) NOT (Europe or Paris or London or Portugal or British or waste) AND (“1990/01/01”[PDAT] : “2000/12/31”[PDAT]) 22 hits

Social Sciences Citation Index Interface: Web of Science Search Search: (Topic=(“baby boomer” or elderly or aged or “senior citizen” or retirement or retiree* or retired or “late life” or “later-life” or “old age” or “older adults” or “later years”)) AND (Topic=(“geographic mobility” or “residential mobility” or migration or relocation or resettlement or “population mobility” or flight)) AND (Topic=(“united states” or canada or “u.s.a.” or minnesota or arizona or florida or “new mexico” or southwest or “sun belt”)) Limited by date, 1990 to 2000 Journal articles only Excluding these subject areas: Clinical neurology; Medicine, general & internal; Nursing; Peripheral vascular disease; Substance abuse 75 hits William H. Walters 981

Sociological Abstracts Interface: CSA Illumina Advanced Search Search: (DE=(“internal migration” or “rural to urban migration” or “urban to rural migration”) or TI=(“internal migration” or “geographic mobility”)) and (DE=(retirement or elderly or aging or “middle aged adults”) or TI=(aging or elderly or retire* or “late life” or “later life”)) Limited by date, 1990 to 2000 Journal articles only 19 hits

The expert searches submitted by the three demography librarians were replicated and evaluated by the author in June 2010. Each set of results was sorted by relevance, if possible, and by publication date otherwise.

Evaluation of Search Results: Recall and Precision Each set of search results was evaluated in terms of both recall and precision. Recall is the effectiveness of the search in retrieving relevant results. Specifically, it is the number of relevant items retrieved as a proportion of the 155 relevant items. Recall is based on the total number of relevant articles— not on the number present within Recall is the effectiveness of the search in any particular database. Recall retrieving relevant results. Specifically, it can be calculated for the set of all search results, or for a particular is the number of relevant items retrieved number of hits. For example, recall as a proportion of the 155 relevant items. at 10, or R(10), is the proportion of the 155 relevant documents that were retrieved within the first ten hits. It has a maximum value (in this case) of 10/155, or 6.5 percent. Likewise, recall at 30, or R(30), is the proportion of the documents that were retrieved within the first thirty hits, with a maximum value (in this case) of 30/155, or 19.4 percent. The overall recall rate is the most appropriate measure of recall if we as- sume that users will examine every search result. However, R(30) (for example) is more appropriate if we assume that users will examine just the first thirty search results. In this study, eight recall values are reported for each search: R(10), R(20), R(30), R(40), Precision is the number of relevant 35 R(50), R(75), R(100), and overall recall. items retrieved as a proportion of Unlike recall, precision accounts for both the retrieval of relevant documents all the items retrieved. and the exclusion of non-relevant docu- ments. Precision is the number of relevant items retrieved as a proportion of all the items retrieved. For precision at 10, or P(10), the denominator is 10; for precision at 20, or P(20), the denominator is 20. For overall precision, the denominator is the total number of search results, up to 300. Eight precision values are presented in this study: P(10), P(20), P(30), P(40), P(50), P(75), P(100), and overall precision. 982 Comparative Recall and Precision b — — 43 12 23 55 39 37 41 93 73 age a Cover- 12 57 96 18 — — 111 117 186 265 30,200 3 3 9 89 17 15 16 28 10 40 1 of 9 Overall 3 9 83 — 17 — 16 24 — 28 1 of 6 2 8 83 — 15 — 15 20 — 23 1 of 6

2 8 11 86 — 10 10 15 — 15 1 of 7 1 6 9 7 71 11 — 14 — 13 2 of 7

1 5 9 6 71 — 10 12 — 10 2 of 7 0 5 7 3 7 8 7 63 — 10 3 of 8 0 2 3 5 1 4 4 5 4 67 3 of 9

c Rank of GS POPLINE PAIS Percentile EconLit SocAbs SSFT GEOBASE SSCI PubMed Table 4 Table Databases Other Eight and in Google Scholar Searches ofRecall Simple R(10) R(20) R(30) R(40) R(50) R(75) R(100) recall N Database GS a. Total number of records retrieved by the search. retrieved number of records a. Total “Google Walters, From record. relevant every for searches title and author on Based database. the in indexed are that records relevant 155 the of Percentage b. “Bibliographic Index Coverage”; and subsequent analyses. and Wilder, Scholar Coverage”; Walters rank of the GS search. c. Percentile database the for rank percentile a computing first by determined was database each of rank overall The order. rank in listed are (databases) searches The Note . ranks. In this table, GS has an average rank in the 77th (as shown for GS), then averaging the eight percentile in every column in which a value was present which puts it in first place overall. percentile, relevance. first) rather than by sorted by date (most recent were 1, the POPLINE, PubMed, and SSFT results As noted in Table Note . Note . The overall recall value for the GS search is based on the first 300 hits. The R(20) value for the PubMed search is based on rather than 100. all records retrieved—96 records retrieved—18 is based on all records rather than 20. The R(100) value for the GEOBASE search records William H. Walters 983

The recall and precision calculations required an assessment of every search result, within the first 300 hits, for each of the 32 searches. In each case, the question was simply “Is this document relevant? Can it be found within the set of relevant articles?”36

Results

Simple Searches GS retrieved 30,200 records in response to the simple search phrase elderly migration. (Only the first 1,000 hits can be viewed in GS, however, and only the first 300 are examined here.) In contrast, no other database had more than 265 hits, and two had fewer than 20. GS is remarkable for the sheer number of records it retrieves. As Table 4 shows, GS also retrieves many relevant records. In terms of overall recall, GS is far superior to the other databases, retrieving forty percent of the 155 relevant re- cords. Only one other database, SSCI, has an overall recall rate greater than 17 percent. Of course many database users will examine only the records that appear early in the list of search results.37 Measures such as R(20) and R(30) may therefore be more meaningful than overall recall. Table 4 reveals that GS has above-average recall regardless of the number of search results evaluated. Specifically, GS appears in third place when we examine the first ten or twenty results; in second place when we examine the first

Figure 1. Recall of Simple Searches in Google Scholar and Eight Other Databases—First 200 Search Results. 984 Comparative Recall and Precision a 18 96 12 57 111 117 265 186 — — 30,200 5 3 83 37 21 26 24 33 40 22 7 of 9 recision N recision 5 83 — 37 43 26 27 — — 14 1 of 6 4 — 41 32 — — 17 48 31 83 1 of 6 Overall 6 — 48 32 — 32 24 46 34 71 2 of 7

3 71 — 55 35 — 28 23 50 43 2 of 7 3 — 63 47 — 30 23 50 50 71 2 of 7

0 63 83 65 55 — 20 35 55 55 3 of 8 0 67 80 60 80 30 10 40 60 60 3 of 9

b P(10) P(20) P(30) P(40) P(50) P(75) P(100) p by the search. retrieved number of records a. Total database rank of the GS search. b. Percentile the for rank percentile a computing first by determined was database each of rank overall The order. rank in listed are (databases) searches The Note . 66th the in rank average an has GS table, this In ranks. percentile eight the averaging then GS), for shown (as present was value a which in column every in place overall. which puts it in third percentile, relevance. first) rather than by sorted by date (most recent were 1, the POPLINE, PubMed, and SSFT results As noted in Table Note . retrieved—18 records all on based is search PubMed the for value P(20) The hits. 300 first the on based is search GS the for value precision overall The Note . rather than 100. records retrieved—96 is based on all records rather than 20. The P(100) value for the GEOBASE search records Percentile Table 5 Table Databases Other Eight and in Google Scholar Searches of Simple Precision Database PubMed SSCI SocAbs PAIS SSFT EconLit POPLINE Rank of GS GS GEOBASE William H. Walters 985 thirty or forty results; and in first place when we examine 50, 75, or 100 results. Figure 1 presents additional detail, showing that for simple searches, SSCI is the only serious competitor to GS in terms of recall. Although PubMed has higher R(10) and R(20) values, its overall recall is low because it returns so few search results. To some extent, these differences in recall can be attributed to the fact that some databases cover (index) more of the relevant literature than others. The last column of Table 4, “Coverage,” shows the percentage of the 155 relevant articles that can be found through title and author searches within each database. For example, GS covers 144 of the 155 relevant articles (93 percent) but retrieves just 62 (forty percent) through a simple keyword search for elderly migration. SSCI covers 73 percent of the relevant articles but retrieves just 28 percent. There is a 0.89 correlation (Pearson’s r) between the Coverage values and the Overall Recall values, leading us to conclude that differences in coverage account for 79 percent (r2) of the variation in overall recall. The intermediate measures of recall—R(10) through R(100)—are not so strongly associated with database cover- age, however. The correlations between Coverage and the R(n) values range from 0.17 for R(20) to 0.80 for R(100), with an average value of 0.57 and a median value of 0.64. This indicates that differences in coverage typically account for about one-third of the variation in recall among the nine databases. The remainder of the variation in recall can be attributed to other factors such as the appropriateness of the search phrases and the effectiveness of the databases’ retrieval mechanisms. Although Google Scholar’s overall recall is very high, its overall precision is rela- tively low: 21 percent (Table 5). This can be attributed to the fact that GS retrieves many more records than the other databases. As mentioned earlier, Google Scholar’s overall precision is based on the first 300 records; in contrast, PubMed’s far higher overall pre- cision (83 percent) is based on just 18 search results. Because PubMed retrieves so few records, a more appropriate indicator for comparing GS and PubMed is P(10). PubMed has a higher P(10) score than GS, but the difference in P(10) values is not nearly as great as the difference in overall precision. The P(10) through P(100) statistics in Table 5 reveal that GS has relatively high pre- cision within the first 100 search results. Specifically, GS appears in third place when we examine the first ten or twenty results; in second place when we examine the first thirty, forty, or fifty results; and in first place when we examine the first 75 or 100 results. Both GS and SSCI are unusual in that they maintain moderately high precision even after the first fifty hits (Figure 2). For example, the P(75) values of GS and SSCI (48 percent and 41 percent) are substantially higher than those of any other database. In contrast, the search results for Sociological Abstracts display a more typical pattern; 8 of the first ten hits are relevant, but precision declines rapidly after that point. Three databases—POPLINE, PubMed, and Social Sciences Full Text—maintain or improve their precision as the number of hits increases, but only because the results for those databases were sorted by date rather than by relevance. As Figures 1 and 2 demonstrate, GS performs especially well after the first 25 hits; when we examine 25 or more search results, GS consistently ranks in first or second place for both recall and precision. As noted in previous research, GS might be more useful to students and scholars if the most relevant results were concentrated within the first twenty hits rather than the first thirty or 50.38 The findings reported here lend further support to the idea that Google Scholar’s mechanism for sorting results is less effective 986 Comparative Recall and Precision

Figure 2. Precision of Simple Searches in Google Scholar and Eight Other Databases—First 200 Search Results.

than its retrieval mechanism. Nonetheless, simple searches in GS have higher recall and precision than those in most of the other eight databases. The results of this analysis confirm that “the idiosyncrasies of Google Scholar’s search mechanism…do not com- promise its ability to retrieve relevant results in response to simple keyword searches.”39

Figure 3. Recall of Expert Searches in Google Scholar and Eight Other Databases—First 200 Search Results. William H. Walters 987 a 39 22 75 14 20 10 16 13 26 34 41 19 13 64 13 169 563 30,200 40,800 6 6 5 5 5 6 5 9 4 11 15 12 22 40 43 33 13 17 36 Overall — — — — — 28 26 26 — — — — — — 18 — — — — — — 22 — — 23 22 21 — — — — — — 14 — — — — 9 — — 16 — — 15 15 17 — — — — — — 14 — — — R(50) R(75) R(100) recallR(50) R(75) R(100) N

9 15 — 14 — — 13 12 14 — — — — — 17 14 — — — 8 8 11 11 14 — 12 — — 10 10 — — — — 12 — — —

8 8 6 7 6 6 8 6 5 6 6 6 10 — — — — — — 5 5 5 5 5 4 5 4 5 5 5 3 1 1 3 3 4 3 4 Table 6 Table Databases Other Eight and in Google of Scholar Recall Expert Searches R(10) R(20) R(30) R(40) Database SocAbs1 PubMed1 SSCI1 SocAbs2 EconLit1 GS Simple GS1 SSCI2 EconLit2 PAIS1 PAIS2 PubMed2 SSFT1 SSFT2 GS2 SocAbs3 SSCI3 EconLit3 POPLINE1 988 Comparative Recall and Precision a 9 0 — — — — — — 40 31 112 9 1 3 0 96 91 74 15 2 of 23 6 of 23 1 of 23 Overall 0 15 — — — — 75 25 3 of 4 4 of 1 of 4 0 12 — — — — 80 20 4 of 5 5 of 1 of 5 9 — — — — 50 33 17 4 of 6 5 of 6 3 of 6 R(50) R(75) R(100) recallR(50) R(75) R(100) N

8 9 0 33 67 — — — 3 of 9 9 of 6 of 9 6 5 3 36 64 18 — — 4 of 11 9 of 11 7 of 11

5 3 3 67 67 20 — — 5 of 15 5 of 15 12 of 15 3 1 1 2 95 41 41 — 1 of 22 13 of 22 13 of 22

b b b Percentile Percentile Rank of GS2 Rank of GS3 by the search. retrieved number of records a. Total rank of the GS search. b. Percentile 4 for details.) The GS (See simple Table Note . is search The also listed are searches shown, in for rank comparative order. purposes. The numbers after the in each database. searches most effective database names indicate the first, second, and third relevance. first) rather than by sorted by date (most recent were 1, the POPLINE, PubMed, and SSFT results As noted in Table Note . Note . The overall recall values for the GS searches are based on the first 300 hits. The R(40) value for SocAbs1 is based onretrieved—39records allrecords rather than 40. The R(20) value rather for than SocAbs3 20. is records based retrieved—19 The on R(10) all value records for the GEOBASE is search based rather than 10. records retrieved—9 on all records Table 6. Continued. Table R(10) R(20) R(30) R(40) Database GS3 PubMed3 GEOBASE POPLINE2 POPLINE3 Percentile Rank of GS1 William H. Walters 989

Expert Searches Table 6 presents recall statistics for the 23 expert searches. As noted earlier, two or three searches, representing the University of Michigan, Penn State University, and Princeton University, were undertaken in every database except GEOBASE. Two of the three GS expert searches have high overall recall rates, and GS retrieves a greater number of relevant results than any other database (Table 6). This reflects both (a) the number of hits generated by a typical GS search, and (b) the fact that GS continues to retrieve relevant records well after the first fifty hits. As Figure 3 illustrates, this gives GS a significant advantage over the other databases. Only eleven of the 23 expert searches returned more than thirty hits, and only two databases—GS and Social Sciences Full Text—returned more than thirty hits in response to every expert search. For users willing to examine more than fifty search results, GS has substantially higher recall than the other eight databases.

Figure 4. Recall of Expert Searches in Google Scholar and Eight Other Databases—First 50 Search Results.

At the same time, GS has comparatively lower recall early in the list of search results. Among the fifteen expert searches with twenty or more hits, the three GS searches rank fifth, fifth, and twelfth in terms of R(20). Among the eleven searches with thirty or more hits, the three GS searches rank fourth, seventh, and ninth in terms of R(30). Figure 4 presents recall data for the first fifty hits. 990 Comparative Recall and Precision

These results suggest that the nine databases can be placed into three categories on the basis of their R(10) through R(50) values: (1) those with relatively high recall across all expert searches: Sociological Abstracts and SSCI; (2) those with moderate recall or inconsistent search performance: GS, EconLit, PAIS, PubMed, and Social Sciences Full Text; (3) those with relatively low recall across all expert searches: GEOBASE and POPLINE. GS falls into the middle category, with recall values that are consistent and moderate. In comparison with the other eight databases, GS has average recall up to R(50) and excellent recall thereafter.

Figure 5. Precision of Expert Searches in Google Scholar and Eight Other Databases—First 200 Search Results.

The comparative precision of the 23 expert searches can be seen in Table 7 and Figure 5. As noted earlier for simple searches, GS and SSCI are the only databases that maintain high precision after the first fifty search results. The most effective expert searches in GS and SSCI demonstrate that even long lists of search results can still include a high proportion of relevant results. For instance, the best GS expert search has a P(100) value of forty percent. William H. Walters 991 a 22 10 14 39 13 16 20 75 26 13 41 34 13 19 64 169 563 30,200 40,800 82 80 64 62 62 50 50 45 65 54 21 22 30 63 59 46 47 19 22 recision N recision Overall — — — — — — — — — — 43 40 40 — — — — 28 — — — — — — — — 45 — — 48 45 43 — — — — 28 — — — — — — — — 50 — — 46 46 52 — — — — 42 28

— — — 62 — — — 53 — — 50 48 55 65 — — — 53 35

— — — 70 — — — 63 — — 50 40 50 57 63 — — 57 43

80 — — 65 — — 50 60 60 — 55 50 45 40 45 — 47 50 45 70 80 70 70 70 70 70 70 40 60 60 70 60 20 20 60 50 40 50 P(10) P(20) P(30) P(40) P(50) P(75) P(100) p Table 7 Table Databases Other Eight and in Google of Scholar Expert Searches Precision Database PubMed1 EconLit1 SocAbs1 SocAbs2 PAIS1 PAIS2 EconLit2 SSCI1 PubMed2 SSCI3 GS Simple GS1 SSCI2 SSFT1 SSFT2 POPLINE1 SocAbs3 GS2 EconLit3 992 Comparative Recall and Precision 9 0 a 40 31 — — — — — — 112 0 9 22 35 21 16 26 13 recision N recision 21 of 23 20 of 23 17 of 23 Overall 0 — — 23 — — 75 25 3 of 4 4 of 1 of 4 0 80 20 — — 24 — — 4 of 5 5 of 1 of 5 50 33 17 — — 28 — — 4 of 6 5 of 6 3 of 6

0 — 35 33 — — 33 56 4 of 9 9 of 6 of 9

— 27 30 17 — 27 64 18 4 of 11 9 of 11 8 of 11

67 67 13 — 20 35 25 — 5 of 15 5 of 15 13 of 15 91 27 41 22 10 50 30 — 2 of 22 16 of 22 13 of 22

b b b P(10) P(20) P(30) P(40) P(50) P(75) P(100) p Percentile Percentile Rank of GS2 Rank of GS3 by the search. retrieved number of records a. Total rank of the GS search. b. Percentile 5 for details.) The GS (See simple Table Note . is search The also listed are searches shown, in for rank comparative order. purposes. The numbers after the in each database. searches most effective database names indicate the first, second, and third relevance. first) rather than by sorted by date (most recent were 1, the POPLINE, PubMed, and SSFT results As noted in Table Note . records retrieved—39 records all on based is SocAbs2 for value P(40) The hits. 300 first the on based are searches GS the for values precision overall The Note . rather than 40. The P(20) value for SocAbs3 is rather based than on records retrieved—19 20. all records The P(10) value for the GEOBASE is search based rather than 10. records retrieved—9 on all records Table 7. Continued. Table Database GEOBASE PubMed3 GS3 POPLINE2 POPLINE3 Percentile Rank of GS1 William H. Walters 993

Figure 6. Precision of Expert Searches in Google Scholar and Eight Other Databases—First 50 Search Results.

Within the first fifty search results, two databases—Sociological Abstracts and PAIS—have consistently high precision (Table 7 and Figure 6). Only one of the three expert searches in Sociological Abstracts retrieved more than 19 records, however, and both of the PAIS searches retrieved fewer than 17. This suggests that expert searches in Sociological Abstracts and PAIS tend to produce results lists that are on-target but short. Most of the other databases, including GS, are notable for their inconsistency, since their precision varies substantially with each search. For instance, the three PubMed searches have P(20) values of twenty percent, sixty percent, and eighty percent. Likewise, the three GS searches have P(30) values ranging from 30 percent to 57 percent. Overall, we can conclude that certain databases (Sociological Abstracts and PAIS) have the potential for high precision within the first fifty results, and that others (GS and SSCI) have the potential for relatively high precision after the first fifty hits.

Simple Versus Expert Searches So far, we have compared each database with the others. In contrast, this section compares simple and expert searches within each database. For example, the simple searches in GS are compared with the expert searches in GS—not with the searches in EconLit or SSCI. In EconLit, PAIS, POPLINE, and Social Sciences Full Text, expert searches are more effective than simple searches (Tables 4–7). Expert searches provide only a modest advantage in terms of recall, however. In EconLit, for instance, the R(20) value for the 994 Comparative Recall and Precision

expert searches is just one percentage point higher than the R(20) value for the simple search. The real advantage of these expert searches lies in their superior precision, especially early in the list of results. For EconLit, PAIS, POPLINE, and Social Sciences Full Text, the best expert search in each database has a P(10) value at least twice that of the corresponding simple search. Moreover, the higher precision of expert searches is consistent across all the searches in each of these databases. In one database, SSCI, the choice of a simple or expert search has only a negligible impact on the search results. As Tables 4–7 reveal, the results for the SSCI simple search are much like those for the three expert searches. Most of the SSCI searches are equally effective, and no one search, simple or expert, is clearly superior to the others. The situation for Google Scholar is substantially different. In terms of both recall and precision, the GS simple search is more effective than any of the three GS expert searches. (Tables 6 and 7 show the results of the GS simple searches along with the results of the expert searches.) This finding may not be surprising for Google Scholar, a database that is still widely criticized for its unsophisticated In terms of both recall and interface and its idiosyncratic search behav- ior.40 However, three other databases display precision, the GS simple search much the same pattern, with simple searches is more effective than any of the that are superior to the expert searches in three GS expert searches. terms of both recall and precision. This can be seen most clearly in Figures 7–14, which present recall and precision data for GS, GEOBASE, PubMed, and Sociological Abstracts. In GS and GEOBASE, the superiority of the simple searches is readily apparent. In PubMed, the simple search has higher recall and precision but returns fewer results. In Sociological Abstracts, the simple search has higher recall and precision than two of the three expert searches. Together, these findings can be used to identify three groups of databases. Group 1 databases (EconLit, PAIS, POPLINE, and SSFT) are those for which expert searches are generally more effective than simple searches. Group 2 databases, here represented by SSCI, are those for which simple and expert searches are comparable in recall and precision. Group 3 databases (GS, GEOBASE, PubMed, and Sociological Abstracts) are those for which simple searches are generally more effective than expert searches. The superiority of simple searches within the Group 3 databases requires further investigation. After all, the simple search phrase (elderly migration) was intended to mimic the behavior of an inexperienced undergraduate with no special knowledge of the sub- ject or the databases. In contrast, the expert searches were constructed by demography librarians at three top universities—individuals with knowledge of the subject area, the databases, and effective searching techniques. Moreover, the expert searchers were encouraged to consult subject thesauruses, conduct preliminary searches, examine their initial search results, and refine their search strategies before choosing the one search that worked best in each database. Table 8 summarizes the characteristics of the searches conducted in each database, allowing us to compare the Group 1 databases (those for which expert searches are more effective) with the Group 3 databases (those for which simple searches are more effec- tive). Looking first at the recall and precision data, we can see that the two groups differ William H. Walters 995

Figure 7. Recall of Simple and Expert Searches in Google Scholar—First 200 Search Results.

Figure 8. Precision of Simple and Expert Searches in Google Scholar—First 200 Search Results.

Figure 9. Recall of Simple and Expert Searches in GEOBASE— First 200 Search Results. 996 Comparative Recall and Precision

Figure 10. Precision of Simple and Expert Searches in GEOBASE—First 200 Search Results.

Figure 11. Recall of Simple and Expert Searches in PubMed— First 200 Search Results. The dark line represents both the simple search and one of the three expert searches.

Figure 12. Precision of Simple and Expert Searches in PubMed— First 200 Search Results. William H. Walters 997

Figure 13. Recall of Simple and Expert Searches in Sociological Abstracts—First 200 Search Results. The dot represents an expert search.

Figure 14. Precision of Simple and Expert Searches in Sociological Abstracts—First 200 Search Results. The dot represents an expert search. chiefly in their simple-search performance. For simple searches, the Group 3 databases are superior in terms of both recall and precision. Specifically, (1) each of the Group 1 simple searches has especially low recall or precision; (2) two of the Group 3 simple searches have especially high recall or precision, and none have especially low recall or precision. In contrast, there are no major differences in inter-group recall or precision when expert searches are compared. The distinguishing characteristic of Group 3 is not that the expert searches are ineffective, but that the simple searches are especially effective. 998 Comparative Recall and Precision — 37 — yes yes 580 high high — — 41 — no yes 366 high — 39 — yes yes 285 low low — 93 no searches Simple yes 248 high high high effective more are — — — 55 no yes 344 low 43 no yes 192 low low low low — 12 yes yes 167 low low high Expert searches effective more are

con- POP- GEO- Pub- Soc- GEO- Pub- POP- con- Lit PAIS LINE SSFT GS BASE Med Abs Med GS BASE LINE SSFT PAIS Lit — — 23 yes yes 284 low low 1 Group 3 Group

b c d

a c Uses controlled vocabulary Uses controlled Precision Recall by relevance Sorts results Precision Recall Avg. length of expert search string length of expert search Avg.

Coverage (%)

Table 8 Table 3 Group 1 and in Group Databases and of Searches Characteristics E Simple searches Database characteristics

Expert searches William H. Walters 999 0 33 67 33 100 100 0 0 0 67 33 100 0 100 100 100 100 100 0 0 0 0 0 100 0 0 50 50 100 100 0 0 33 33 33 33 0 0 0 50 100 100 33 33 33 67 67 100

e subject descriptors geographical descriptors title-field searching exact-phrase searching of words/phrases truncation journal articles limiter 1. 4 and Figure a. See Table 2. 5 and Figure b. See Table “Google Walters, From results. c. See the discussion of expert search record. relevant every for searches title and author on Based database. the in indexed are that records relevant 155 the of Percentage d. “Bibliographic Index Coverage”; and subsequent analyses. and Wilder, Scholar Coverage”; Walters only.” e. “Journal articles only” or “peer-reviewed Note . SSCI is not included in either Group, since simple and expert searches are equally effective in SSCI. operators and date limits. All the expert searches made use of Boolean % of expert searches that used % of expert searches 1000 Comparative Recall and Precision

Table 8 reveals a number of additional inter-group differences, two of which seem noteworthy.41 In comparison with the Group 1 databases, the Group 3 databases are fifty percent more likely to offer the option of sorting the results by relevance. Moreover, they have an average coverage rate 58 percent higher than that of the Group 1 databases. As noted earlier, coverage represents the proportion of the 155 relevant articles that are indexed in the database and therefore potentially retrievable through keyword and sub- ject searches. High coverage rates and effective relevance sorting help explain why the Group 3 databases have especially good simple-search performance, since both thorough coverage and effective sorting make the quality of the search results less dependent on the quality of the search itself. For example, Google Scholar’s 93 percent coverage rate increases the chance that even poorly designed search strategies will achieve reasonably good results. In contrast, the 23 percent coverage rate of EconLit, the twelve percent coverage rate of PAIS, and the absence of relevance sorting in POPLINE and Social Sciences Full Text virtually ensure that poorly designed search strategies will lead to inferior results. High coverage rates and effective relevance sorting can help us understand why simple searches might be just as effective as expert searches in GS, GEOBASE, PubMed, and Sociological Abstracts. Unfortunately, neither factor suggests why simple searches might be more effective than expert searches.

Discussion

Major Findings This study confirms that within the field of later-life migration, simple searches in GS have high precision and high recall relative to simple searches in EconLit, GEOBASE, PAIS, POPLINE, PubMed, SSCI, Social Sciences Full Text, and Sociological Abstracts. However, Google Scholar’s advantage diminishes considerably when expert searches are considered. For expert searches, Google Scholar’s performance is no better or worse than average within the first fifty search results. On the other hand, GS is one of the few databases to continue retrieving relevant results with reasonably high precision after the fiftieth hit. The comparative effectiveness of GS therefore depends on the number of search results that users are willing to examine. Of the nine databases evaluated here, Sociological Abstracts and SSCI are the only ones with consistently high expert-search recall across the entire range of search results. Sociological Abstracts and PAIS have the highest expert-search precision within the first fifty results, although GS and SSCI have the highest precision after the fiftieth hit. At the other end of the spectrum, expert searches in GEOBASE and POPLINE are notable for their low recall and precision. Perhaps surprisingly, the simple searches in GS, GEOBASE, PubMed, and Socio- logical Abstracts had consistently higher recall and precision than the expert searches. That is, simple keyword searches were more effective than the searches carefully con- structed by demography librarians at three major research universities. Moreover, the expert searches in GEOBASE, PubMed, and Sociological Abstracts made full use of controlled-vocabulary subject terms, Boolean operators, geographical descriptors, title- William H. Walters 1001 field searching, exact-phrase searching, truncation of words and phrases, and “journal articles only”/“peer-reviewed only” limiters. As Figures 7–14 reveal, neither advanced search features nor the searchers’ knowledge and experience made expert searches any more effective than simple keyword searches. How can simple searches be more effective than expert searches? Differences in coverage rates appear to offer a partial explanation, since the databases with superior simple-search performance are the same ones that cover (index) a high proportion How can simple searches be more of the relevant literature. For these data- bases, the quality of the search results may effective than expert searches? be less closely linked to the effectiveness of the search strategy, since even poorly con- structed searches in GS, for example, draw upon the large pool of relevant documents available within the database. At best, however, this explanation suggests only why simple and expert searches might be equally effective; it cannot account for those cases in which simple searches are more effective than expert searches.

Implications for Practice Google Scholar is a valuable tool for serious research. It is notable for both its effec- tiveness and for the unconventional methods used to build the database. GS should therefore be regarded not as just another search tool, but as one that provides unique opportunities for instruction about the ways in which the scholarly literature is indexed and made available. As Figures 7–14 reveal, simple searches can sometimes be more effective than expert searches. The reasons for this are uncertain, although both formal studies and informal investigations within particular libraries can help determine when simple searches GS should therefore be regarded are especially appropriate. First, however, we need to accept the idea that advanced not as just another search tool, searches are not always superior, even for but as one that provides unique in-depth research. opportunities for instruction These findings also demonstrate that database users should be willing to exam- about the ways in which the ine more than the first few pages of search scholarly literature is indexed results. In both GS and SSCI, precision—the likelihood that any particular result will be and made available. relevant—declines only slightly, or not at all, from the thirtieth record to the 100th. The most effective searches for the literature of later-life migration have greater than forty percent precision even after the 100th hit. More generally, it is important to understand the extent to which conventional da- tabases and interfaces conform (or do not conform) to librarians’ conceptual models of the ideal bibliographic database. Critics who point to the flaws of GS—the absence of a controlled vocabulary, for instance—sometimes overlook the fact that several conven- tional databases also lack a controlled vocabulary for subject terms. Likewise, unusual 1002 Comparative Recall and Precision

interfaces and search procedures are by no means limited to Google Scholar. Many search interfaces leave users guessing which fields are included in a keyword search, and most multidisciplinary databases lack broad subject limiters of the kind provided by GS. While most databases use an implied AND between adjacent words, a signifi- cant number do not. POPLINE even requires an ampersand—not AND—as a Boolean operator. As this study demonstrates, there is room for improvement across the entire range of bibliographic databases. Finally, we may benefit from a more explicit recognition of the multiple functions of bibliographic databases. Students and scholars commonly use these databases for three distinct purposes: (1) to identify relevant documents on a particular subject; (2) to find or verify bibliographic information; (3) to gain access to full-text documents. Traditionally, these three processes (steps) have been linked in a particular order. The identification of relevant documents led immediately to the verification of bibliographic information, which was then used to access the documents, often through a second da- tabase such as a library catalog. Recently, however, it has become possible to proceed di- rectly from step 1 to step 3 by using the full-text links associated with particular database records. Likewise, users can proceed from step 2 to step 3 even without complete biblio- graphic information—sometimes If complete and accurate bibliographic with just a keyword-searchable information is no longer needed to gain sentence found within the text of an article. Arguably, step 2 is not access to relevant documents, can a data- an end in itself, but a process that base be truly effective even without it? contributes to step 1 (by provid- ing information that is helpful in assessing relevance) and to step 3 (by facilitating the retrieval of previously identified documents). If complete and accurate bibliographic information is no longer needed to gain access to relevant documents, can a database be truly effective even without it? This question is especially meaningful in the case of Google Scholar, since GS is very good at identifying relevant documents but not very good at providing complete, reliable metadata.42

Further Research The results of this study are limited by its emphasis on a single subject area. Similar investigations involving a broader range of subject areas might allow us to assess the reliability of these findings across disciplines. A more naturalistic search process might also be helpful. After all, the searches constructed for this investigation were based on a set of written instructions rather than a series of reference interviews. Consequently, the searchers did not have access to the feedback that patrons would normally provide during the search process and the examination of results. Although high coverage rates and effective sorting mechanisms may help account for the effectiveness of simple searches in the Group 3 databases, neither factor can explain William H. Walters 1003 why simple searches might be more effective than expert searches. Because the expert searches used in this study were different from the simple searches in multiple respects, we cannot identify which particular factors were responsible for the effectiveness of each search. Further research may allow us to address this question by examining how search performance is influenced by each of several factors: (1) knowledge of the search topic (domain knowledge);43 (2) knowledge of search options and strategies (professional knowledge); (3) knowledge and experience related to particular databases; (4) availability or use of the full range of Boolean operators; (5) availability or use of controlled subject vocabulary and subject thesauruses; (6) availability or use of other advanced search features such as geographical descrip- tors and document-type limiters. Although the results presented here are useful for assessing the utility of each database, they do not provide conclusive evidence that any particular search option or feature is essential for effective searching.

Acknowledgements

I am grateful for the contributions of Cheryl Collins, Joann Donatiello, Tara Murray, Darlene Nichols, Carol Wright, and three anonymous referees.

William H. Walters is Dean of Library Services and Associate Professor of Social Sciences, Menlo College, Atherton, CA; he may be contacted via e-mail at: [email protected].

Notes

1. Roger C. Schonfeld and Ross Housewright, Faculty Survey 2009: Key Strategic Insights for Libraries, Publishers, and Societies (New York: Ithaka, 2010), http://www.ithaka.org/ithaka- s-r/research/faculty-surveys-2000-2009/Faculty%20Study%202009.pdf (accessed May 25, 2011). 2. Laura Bowering Mullen and Karen A. Hartman, “Google Scholar and the Library Web Site: The Early Response by ARL Libraries,” College & Research Libraries 67, 2 (2006): 106–122; Joan Giglierano, “Attitudes of OhioLINK Librarians Toward Google Scholar,” Journal of Library Administration 47, 1/2 (2008): 101–113. 3. Karen A. Hartman and Laura Bowering Mullen, “Google Scholar and Academic Libraries: An Update,” New Library World 109, 5/6 (2008): 211–222; Chris Neuhaus, Ellen Neuhaus, and Alan Asher, “Google Scholar Goes to School: The Presence of Google Scholar on College and University Web Sites,” Journal of Academic Librarianship 34, 1 (2008): 39–51. 4. Google, “About Google Scholar” (2011), http://scholar.google.com/intl/en/scholar/ about.html (accessed May 25, 2011); Mick O’Leary, “Google Scholar: What’s in It for You?” Information Today 22, 7 (2005): 35–39. 5. Dean Giustini and Eugene Barsky, “A Look at Google Scholar, PubMed, and Scirus: Comparisons and Recommendations,” Journal of the Canadian Health Libraries Association 26, 3 (2005): 85–89; Chuck Hamaker and Brad Spry, “Key Issue: Google Scholar,” Serials 18, 1 (2005): 70–72; Péter Jacsó, “Péter’s Picks & Pans: CiteBaseSearch, Institute of Physics Archive, and Google’s Index to Scholarly Archive,” Online 28, 5 (2004): 57–60; Péter Jacsó, “Google Scholar: The Pros and the Cons,” Online Information Review 29, 2 (2005): 208–214; Péter Jacsó, “Visualizing Overlap and Rank Differences Among Web-Wide Search Engines: 1004 Comparative Recall and Precision

Some Free Tools and Services,” Online Information Review 29, 5 (2005): 554–560; Greg R. Notess, “Scholarly Web Searching: Google Scholar and Scirus,” Online 29, 4 (2005): 39–41. 6. William H. Walters, “Google Scholar Coverage of a Multidisciplinary Field,” Information Processing and Management 43, 4 (2007): 1121–1132; William H. Walters, “Google Scholar Search Performance: Comparative Recall and Precision,” portal: Libraries and the Academy 9, 1 (2009): 5–24. 7. Walters, “Google Scholar Search Performance.” 8. Janice Adlington and Chris Benda, “Checking Under the Hood: Evaluating Google Scholar for Reference Use,” Internet Reference Services Quarterly 10, 3/4 (2005): 135–148; Rebecca Donlan and Rachel Cooke, “Running with the Devil: Accessing Library-Licensed Full-Text Holdings Through Google Scholar,” Internet Reference Services Quarterly 10, 3/4 (2005): 149–157; Péter Jacsó, “As We May Search: Comparison of Major Features of the Web of Science, , and Google Scholar Citation-Based and Citation-Enhanced Databases,” Current Science 89, 9 (2005): 1537–1547; Martin Myhill, “Google Scholar,” The Charleston Advisor 6, 4 (2005): 49–52; Marydee Ojala, “Scholarly Mistakes,” Online 29, 3 (2005): 26; Roy Tennant, “Google, the Naked Emperor,” Library Journal 130, 13 (2005): 29. 9. Joann M. Wleklinski, “Studying Google Scholar: Wall to Wall Coverage?” Online 29, 3 (2005): 22–26; Susan Gardner and Susanna Eng, “Gaga Over Google? Scholar in the Social Sciences,” Library Hi Tech News 22, 8 (2005): 43; Burton Callicott and Debbie Vaughn, “Google Scholar vs. Library Scholar: Testing the Performance of Schoogle,” Internet Reference Services Quarterly 10, 3/4 (2005): 71–88. 10. Marilyn Christianson, “Ecology Articles in Google Scholar: Levels of Access to Articles in Core Journals,” Issues in Science and Technology Librarianship 49 (2007), http://www.istl. org/07-winter/refereed.html (accessed May 25, 2011); Miguel A. García-Pérez, Accuracy and Completeness of Publication and Citation Records in the Web of Science, PsycINFO, and Google Scholar: A Case Study for the Computation of h Indices in Psychology,” Journal of the American Society for Information Science and Technology 61, 10 (2010): 2070–2085; Péter Jacsó, “Amazon, Google Book Search, and Google Scholar,” Online 32, 2 (2008): 51–54; Péter Jacsó, “Google Scholar Revisited,” Online Information Review 32, 1 (2008): 102–114; Péter Jacsó, “Metadata Mega Mess in Google Scholar,” Online Information Review 34, 1 (2010): 175–191; Margie Ruppel, “Google Scholar, Social Work Abstracts (EBSCO), and PsycINFO (EBSCO),” The Charleston Advisor 10, 3 (2009): 5–11. 11. Lydia Dixon et al, “Finding Articles and Journals via Google Scholar, Journal Portals, and Link Resolvers: Usability Study Results,” Reference & User Services Quarterly 50, 2 (2010): 170–181; Seikyung Jung et al, “LibraryFind: System Design and Usability Testing of Academic Metasearch System,” Journal of the American Society for Information Science and Technology 59, 3 (2008): 375–389. 12. Chris Neuhaus et al, “The Depth and Breadth of Google Scholar: An Empirical Study,” portal: Libraries and the Academy 6, 2 (2006): 127–141. 13. Christianson, “Ecology Articles in Google Scholar”; John J. Meier and Thomas W. Conkling, “Google Scholar’s Coverage of the Engineering Literature: An Empirical Study,” Journal of Academic Librarianship 34, 3 (2008): 196–201. 14. Philipp Mayr and Anne-Kathrin Walter, “An Exploratory Study of Google Scholar,” Online Information Review 31, 6 (2007): 814–30. Reprinted as “Studying Journal Coverage in Google Scholar,” Journal of Library Administration 47, 1/2 (2008): 81–99. 15. Walters, “Google Scholar Coverage.” 16. Lyle Ford and Lisa Hanson O’Hara, “It’s All Academic: Google Scholar, Scirus, and Windows Live Academic Search,” Journal of Library Administration 46, 3/4 (2008): 43–52. 17. Michael Norris, Charles Oppenheim, Fytton Rowland, “Finding Open Access Articles Using Google, Google Scholar, OAIster and OpenDOAR,” Online Information Review 32, 6 (2008): 709–715. 18. Jared L. Howland et al, “Google Scholar and the Continuing Education Literature,” Journal of Continuing Higher Education 57, 1 (2009): 35–39. William H. Walters 1005

19. Hannah Rozear, “Where Google Scholar Stands on Art: An Evaluation of Content Coverage in Online Databases,” Art Libraries Journal 34, 2 (2009): 21–25. 20. Susanne Mikki, “Comparing Google Scholar and ISI Web of Science for Earth Sciences,” Scientometrics 82, 2 (2010): 321–331. 21. Rebecca Griffiths, Michael Dawson, and Matthew Rascoff, Scholarly Communications in the History Discipline: A Report Commissioned by JSTOR (New York: Ithaka, 2006), http://www. ithaka.org/publications/History (accessed May 25, 2011). 22. Rena Helms-Park, Pavlina Radia, and Paul Stapleton, “A Preliminary Assessment of Google Scholar as a Source of EAP Students’ Research Materials,” Internet and Higher Education 10, 1 (2007): 65–76. 23. Jared L. Howland et al, “How Scholarly is Google Scholar? A Comparison to Library Databases,” College & Research Libraries 70, 3 (2009): 227–234. 24. D. Yvonne Jones, “Biology Article Retrieval from Various Databases: Making Good Choices with Limited Resources,” Issues in Science and Technology Librarianship 44 (2005), http:// www.istl.org/05-fall/refereed.html (accessed May 25, 2011). 25. Mary L. Robinson and Judith Wusteman, “Putting Google Scholar to the Test: A Preliminary Study,” Program: Electronic Library and Information Systems 41, 1 (2007): 71–80. The study also included four non-scholarly searches for topics of general interest. In those cases, GS did not perform as well. 26. Glenn Haya, Else Nygren, and Wilhelm Widmark, “Metalib and Google Scholar: A User Study,” Online Information Review 31, 3 (2007): 365–375. 27. Michael Levine-Clark and Joseph Kraus, “Finding Chemistry Information Using Google Scholar: A Comparison with Chemical Abstracts Service,” Science & Technology Libraries 27, 4 (2007): 3–17 28. Walters, “Google Scholar Search Performance.” 29. Per O. Seglen, “The Skewness of Science,” Journal of the American Society for Information Science 43, 9 (1992): 628–638. 30. William H. Walters, “Later-Life Migration in the United States: A Review of Recent Research,” Journal of Planning Literature 17, 1 (2002): 37–66. Reprinted in William H. Walters, “Types and Determinants of Later-Life Migration” (PhD diss., Brown University, 2002). The 155 relevant documents include all the works cited in the bibliography except for those published before 1990 (forty items); those published after 2000 (one item); those published as books, book chapters, or dissertations (26 additional items); and those that are primarily bibliographic or editorial in nature (ten items). 31. Although the nine simple searches are very similar to those presented in an earlier investigation (Walters, “Google Scholar Search Performance”), this study evaluates a different set of databases. Specifically, it examines PubMed rather than MEDLINE, Sociological Abstracts rather than SocINDEX, and Social Sciences Full Text rather than Social Sciences Abstracts. AgeLine, ArticleFirst, and EBSCO Academic Search were included in the earlier study but are not examined here. Although the hit counts reported here for the simple searches in GS, EconLit, and POPLINE are slightly different from those reported in “Google Scholar Search Performance,” the search phrase and the interface were the same in both studies. Any discrepancies are presumably due to changes in the databases that occurred between the two analyses—between February 2008 and June 2010. 32. Walters, “Google Scholar Coverage”; William H. Walters and Esther I. Wilder, “Bibliographic Index Coverage of a Multidisciplinary Field,” Journal of the American Society for Information Science and Technology 54, 14 (2003): 1305–1312; William H. Walters and Esther I. Wilder, “Disciplinary Perspectives on Later-Life Migration in the Core Journals of Social Gerontology,” The Gerontologist 43, 5 (2003): 758–760. 33. APLIC, “Membership List” (2009), http://www.aplici.org/membership/membership- list/ (accessed May 25, 2011); National Archive of Computerized Data on Aging, “NIA Funded Centers: Centers on the Demography of Aging” (2011), http://www.icpsr.umich. edu/icpsrweb/NACDA/nia-centers/ (accessed May 25, 2011); U.S. News & World Report, 1006 Comparative Recall and Precision

“Sociology Specialty Rankings: Sociology of Population” (2011), http://grad-schools. usnews.rankingsandreviews.com/best-graduate-schools/top-sociology-schools/sociology- of-population (accessed May 25, 2011). 34. As Table 3 suggests, this study assess the performance of entire information systems, each of which has several components: people, data, software, hardware, and network resources. While an investigation of database effectiveness might use a single search phrase across several databases in order to isolate the impact of database choice on search effectiveness, the goal of this study is to evaluate information systems rather than databases. In this context, the variation in search strategies is not a confounding factor, but a reflection of the fact that each information system has different capabilities and characteristics. 35. As noted earlier, each of the 155 relevant records can be found in the bibliography of a published literature review: Walters, “Later-Life Migration.” Because some articles are indexed in GS solely due to their inclusion in the bibliographies of previously indexed papers, there is a possibility that the earlier paper (which was indexed in GS) might have itself led to an improvement in Google Scholar’s coverage of the 155 relevant articles. Further investigation reveals that any such bias is likely to be minimal, however. GS includes a special notation in records that were generated from the bibliographies of previously indexed papers, and only five relevant records from the four GS searches carry that notation. Each of the five was cited in at least seven GS-indexed articles, and four of the five were cited in at least four GS-indexed articles that appeared prior to “Later-Life Migration.” It is therefore unlikely that any of them were included in the database due to their appearance in that paper. 36. Several rules were established before the assessment process began. Records for working papers that were later published as relevant articles were not counted as relevant, since those records are missing publication information that may influence users’ assessments of relevance. Likewise, records for collections of articles (such as special issues of journals) that mention a relevant article were not counted as relevant. If a relevant article appeared more than once in the results for a particular search, only the first occurrence—the lowest- numbered hit—was counted. 37. Most users of internet search engines examine just a single page of results, although users of scholarly databases are likely to be somewhat more thorough. See, for example, Bernard J. Jansen and Amanda Spink, “How Are We Searching the World Wide Web? A Comparison of Nine Search Engine Transaction Logs,” Information Processing and Management 42, 1 (2006): 248–263. 38. Walters, “Google Scholar Search Performance.” 39. Walters, “Google Scholar Search Performance,” 10. 40. Jacsó, “Google Scholar Revisited”; Jacsó, “Metadata Mega Mess”; Ruppel, “Google Scholar, Social Work Abstracts (EBSCO), and PsycINFO (EBSCO).” 41. As Table 8 reveals, expert searches in the Group 3 databases are also characterized by relatively long search phrases; relatively infrequent use of “journal articles only” or “peer- reviewed only” limiters; and relatively frequent use of geographical area descriptors, title-field searching, and exact phrase searching. These differences do not help explain the relative effectiveness of simple searches in the Group 3 databases, however. 42. Christianson, “Ecology Articles in Google Scholar”; Jacsó, “Metadata Mega Mess.” 43. See, for example, Nicolas Vibert et al, “Effects of Domain Knowledge on Reference Search with the PubMed Database: An Experimental Study,” Journal of the American Society for Information Science and Technology 60, 7 (2009): 1423–1447.