Available online at www.sciencedirect.com ScienceDirect

Procedia Computer Science 96 ( 2016 ) 169 – 178

20th International Conference on Knowledge Based and Intelligent Information and Engineering Systems, KES2016, 5-7 September 2016, York, Evaluating the suitability of Web search engines as proxies for knowledge discovery from the Web

Laura Martínez-Sanahuja*, David Sánchez

UNESCO Chair in Data Privacy, Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili Av.Països Catalans, 26, 43007 Tarragona, Catalonia,

Abstract

Many researchers use the Web search engines’ hit count as an estimator of the Web information distribution in a variety of knowledge-based (linguistic) tasks. Even though many studies have been conducted on the retrieval effectiveness of Web search engines for Web users, few of them have evaluated them as research tools. In this study we analyse the currently available search engines and evaluate the suitability and accuracy of the hit counts they provide as estimators of the frequency/probability of textual entities. From the results of this study, we identify the search engines best suited to be used in linguistic research.

© 20162016 TheThe Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International. Peer-review under responsibility of KES International

Keywords: Web search engines; hit count; information distribution; knowledge discovery; semantic similarity; expert systems.

1. Introduction

Expert and knowledge-based systems rely on data to build the knowledge models required to perform inferences or answer questions. Yet, their performance is tied to the availability and coverage of the electronic data they use as knowledge sources. In this respect, the success of the Internet has multiplied the amount of electronic resources that are freely available for research. By exploiting these large data sources, it has been possible to build expert systems with a performance we were only able to imagine few years ago. For instance, the well-known Watson expert system by IBM was able to win the Jeopardy! quiz show in front of expert human players by exploiting more than

* Corresponding author. Tel.: +34 977559657; fax: +34 977559710. E-mail address: [email protected] (L. Martínez-Sanahuja), [email protected] (D. Sánchez)

1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of KES International doi: 10.1016/j.procs.2016.08.123 170 Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178

200 million pages of linguistic electronic content, which included the whole [1] . Many expert systems rely on linguistic data for knowledge discovery, especially those dealing with textual inputs/outputs. Indeed, most of the information that is being produced nowadays is textual, because it constitutes the natural mean of interaction among human actors. In this respect the is currently the largest freely available source of electronic data, most of which is of linguistic nature. As we realize more and more on the importance of the availability of big linguistic data for the development of expert systems, the more appealing the use of the Web as knowledge source it becomes. In fact, the Web is so large, heterogeneous and up-to-date that it is said to be a faithful representation of the current information distribution at a social scale [2], an argument that has been supported by recent works [3-5], which considered the Web as a realistic proxy for social knowledge. Because of its interesting features, many researchers have used the Web as a knowledge source and, more specifically, to estimate the distribution of linguistic data from the frequency/probability of (co-)occurrence of entities of interest (e.g., textual terms, concepts, bigrams, etc.). In this way, researchers attempt to alleviate the constraints imposed by static linguistic corpora that, despite being reliable and unambiguous, are limited in terms of size, coverage and updates, thus usually producing data sparseness problems [5]. The usual low-entry-cost way to access to Web data is via a commercial Web (WSE). Indeed, frequencies or probabilities for some phenomenon of interest can be straightforwardly estimated from the hit count provided in the search engine’s result page [6]. Early works using hit counts tackled the identification of translation for compositional phrases [7], the discovery of synonyms [8] or the assessment of frequencies of bigrams [9]. More recent works include building models of noun compound bracketing [10], automatic ontology learning [4, 11-13], large-scale information extraction [14, 15], semantic similarity estimation [2, 16, 17], topic discovery [18], user profiling [19] or disclosure risk assessment in textual documents [3, 5, 20]. When WSEs are used as proxies for the Web information distribution in tasks such as the former ones, the outcomes of these tasks closely depend on the suitability of the hits counts as frequency estimators. Yet, many researchers relegate the choice of the search engine, and employ the WSE they are familiarized with, thus potentially compromising their research results; in fact, the search engines most commonly used in research are also those most used in general: , Bing and Yahoo! [2, 5, 17, 21]. Some studies have criticized these choices and questioned the usefulness of well-known WSEs as research tools due to the issues they present (ambiguity, constrained query languages, commercial bias, arbitrariness of hit counts, etc.) [22]. The study we conduct in this paper aims at bringing some light into these issues and, specifically, to the following questions: are WSE really effective as proxies for Web information distribution?, how far the choice of a particular WSE may influence the outcomes of the task to which it is applied?, and ultimately, which is the WSE best suited for linguistic research? For this purpose, we survey and systematically analyse most of the WSEs currently available (being commercial or not) and select those able to provide hit counts that can be used as general-purpose estimators of term frequencies. The selected search engines are evaluated from both qualitative and quantitative perspectives. In the former case, we define a set of quality criteria that the WSE should fulfil in order to be considered an appropriate research tool (i.e., mathematical coherence of hit counts, flexibility of the query language, non-exact search capabilities and access restrictions). In the latter application-oriented evaluation, we use the hit count of the different WSEs in one of the central tasks of computational linguistics (i.e., the estimation of the semantic similarity between concepts), and objectively measure the accuracy of the outcomes they provide. From the results of these evaluations, we identify the search engines(s) best suited for linguistic research. Many studies have evaluated the effectiveness of WSEs as information retrieval tools [23-27]. However, few works have analysed the suitability of WSEs’ hit counts for linguistic research. In [28], the author measured the correlation between the hits counts provided by several well-known search engines for as set of queries, whereas in [29], the coherence of the hit counts was tested against the actual number of web sites indexed by the WSE; finally, the studies performed in [30, 31] focused on analysing the reliability of the hit counts through time. All the former works focused on the three most used WSEs: Google, Yahoo and Bing/Live Search. As main contributions over these works, our study provides a more up-to-date survey, considers a much broader spectrum of WSE (going beyond the “usual suspects” Google, Bing and Yahoo) and, in addition to the application-agnostic analysis of WSEs’ hit counts, we provide an application-oriented evaluation in a core task of linguistic research: the estimation of the semantic similarity between textual terms; with this we aim not only at assessing the potential of WSEs’ hit counts, but also to measure their actual performance in a realistic research setting. Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178 171

The rest of the paper is structured as follows. The next section details the WSEs we reviewed and explains the selection criteria for those that are evaluated later. Section 3 details the criteria and the results of the qualitative evaluation for the chosen WSEs. Section 4 depicts the metrics, benchmark and results of the quantitative application-oriented evaluation. Section 5 discusses the results of the evaluations and provides advice on choosing a WSE suitable for research. The final section presents the main conclusions and depicts some lines of future research.

2. Surveyed WSEs and selection criteria

Since our study aims to be general and domain-independent, we focus on WSEs supporting general-purpose searches. As a result, we omitted those constrained to a certain topic or content, such as medical search engines (like PubGene or GoPubMed) or educational ones (like BASE or Google Scholar). For a search engine to be considered in our study we set additional criteria that are desirable for linguistic research. First, the WSE should provide a standard search bar for introducing textual queries. Secondly, it should obviously provide the hit count associated to the performed query. Finally, it should be cross-language. In some cases, we identified WSEs that redirected to or were powered by search engines of other vendors. In such cases, once we checked that the results were indeed equivalent to those of the main vendor, we restricted our analysis to the latter. The search for WSEs was done in January 2016 and was based on Wikipedia articles and web surveys on search engines; a total of 58 WSEs were compiled and analysed. Table 1 lists the WSEs that didn’t fulfil some of the above criteria, whereas Table 2 lists the 13 WSEs that fulfilled all the criteria and were subject of evaluation.

Table 1. List of rejected WSEs. Reason Web search engine No textual search bar Alexa Internet; ; GrayMatter; joongle; Kosmix; Mahalo; Munax; Voila Inactive Yauba; Neuralcoder Down for maintenance NowRelevant Hit count not provided Ask.com; ; DuckDuckGo; ; Gyffu; info.com; ixquick; Mamma; ; WebCrawler; YaCy No textual data, just photos Specify Arbitrary hit count Exalead No hit count for some queries Trovator Extremely low hit count Scour; Zuula No online version Redirects to Bing MSN Redirects to Google iAlgae; Wopa! Redirects to Yahoo! Alltheweb; Altavista Language: Chinese Panguso; ; ; .com; Language: Chinese, Japanese Language: French Dazoo FR; LeMoteur; Premsgo Language: Italian Virgilio.it Language: Korean Language: Swedish

3. Qualitative evaluation

Because WSEs are aimed to standard Web users, it is understandable that they present some limitations when used for research. In the following, we discuss some of the research-oriented problems that have been traditionally attributed to WSEs [6] and evaluate whether current WSEs still suffer from them.

172 Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178

Table 2. Selected general-purpose WSEs. WSE Year Brief overview AOL Search 2005 AOL Inc. is an American mass media corporation that develops, grows, and invests in brands and web sites. Bing 2009 Bing is a web search engine owned and operated by . The service has its origins in Microsoft's previous search engines: MSN Search, Windows Live Search and later Live Search. By February 2015 it is the second largest search engine in the US (after Google) with a 19.8% market share. 2009 Ecosia is a web search engine based in , . Ecosia’s search results are powered by Bing, but the actual hit count differs from those of Bing. Entireweb 2000 Entireweb.com is a search engine launched on 2000 by Entireweb Sweden AB. Gibiru 2009 Gibiru provides uncensored and unpersonalised anonymous Web and news results. Gibiru is not partnered with the NSA so users can browse the Internet without being tracked. 2000 Gigablast is a small independent web search engine based in New Mexico. Google 1998 is the most-used search engine on the World Wide Web, handling more than three billion searches each day. As of February 2015 it is the most used search engine in the US with 64.5% market share. Hotbot 1996 HotBot is a web search engine currently owned by . It was launched in May 1996 by Wired magazine. Lycos 1994 Lycos, Inc. is a search engine and established in 1994, spun out of Carnegie Mellon University. 2009 Mojeek is the UK's number one web search engine, providing unbiased, fast, and relevant search results combined with a no user tracking privacy policy. Mozbot 2003 Mozbot was previously called Reacteur.com. Its initial orientation, when it was created in 2003, was as a laboratory for ideas related to searching the Web for information. It allowed to test many search functionalities over the months and has led them to have a better understanding of how search engines work. Yahoo! Search 1995 Yahoo! Search is a web search engine owned by Yahoo. As of February 2015 it is the third largest search engine in the US by the query volume at 12.8%, after its competitors Google at 64.5% and Bing at 19.8%. Yandex 2010 Yandex is a general-purpose Russian search engine that account 60% market share in the country and that is now the 4th largest search engine worldwide.

Another problem that linguistic researchers have to face is the lack of flexibility of the WSE’s query language. For regular Web users a simple query language is enough; however, linguistic researchers usually require operators that allow them specifying the size of the co-occurrence context (i.e., on document or sentence basis), or defining character wildcards and even regular expressions. At a more abstract level, the possibility of searching according to part-of-speech categories would be desirable in many tasks. An issue that may also hamper the suitability of WSEs as estimators of the Web information distribution is the lack of mathematical coherence of the hit counts with respect to the query syntax. Ideally, hit counts should be coherent with the logic operators used in the queries, such as AND, OR or NOT; however, this rarely holds in practice due to a number of causes, such as the fact that WSEs may access to different caches for even consecutive queries, or because of the approximations implemented by the WSE’s hit counting algorithm. Finally, at an operational level, research efforts may be limited by access restrictions regarding the number of consecutive queries allowed by WSEs (which are implemented to limit abuses or prevent attacks). To evaluate the state of current WSEs regarding the above issues, we analysed the following aspects (see the results of these analyses in Tables 3 and 4): • Non-literal searches: by examining the actual results provided by the WSEs for a set of queries, we evaluated whether the WSE considered different lexicalizations and synonyms of the queries. Queried terms were extracted from the linguistic benchmark we depict in Section 4. • Flexibility of the query language: even though all WSEs implement Boolean search operators (i.e., AND, OR and NOT), we also checked whether they allow defining the size of the co-occurrence contexts for term pairs by means of proximity search operators (e.g., NEAR/PHRASE operators), and if they support character wildcards (e.g., * or ?) or regular expressions. Finally, we checked if they provide the possibility to define part-of-speech searches (e.g., nouns, adjectives, etc.). • Hit count coherence: this aspect refers to the mathematical coherence of hit counts with respect to the query syntax. Specifically, we empirically tested whether the order of the terms in the query involving the AND Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178 173

operator significantly influences the hit counts; that is whether hits(a AND b) is equal or, at least, similar to hits(b AND a). For a WSE to be considered coherent, we set a maximum divergence between the hit count of both queries of a 25%. As above, the term pairs queried in this test were extracted from the linguistic benchmark we detail in Section 4. • Access restrictions: first, we checked whether the WSE offers an API to perform searches. Then, we checked the maximum number of queries allowed per day and IP, either directly or through the API.

Table 3. WSEs’ support for non-literal searches and flexibility of the query language. Supports Part-of-speech Proximity search Regular WSE Wildcards lexicalizations searches operators expressions AOL Search Yes Yes (nouns) PHRASE No No Bing Yes Yes (nouns) NEAR : No Ecosia Yes No No No No Entireweb Yes No No No No Gibiru Yes No No No No Gigablast Yes No No No No Google Yes Yes (nouns) NEAR @, $, #, *, [..] Yes Hotbot Yes No NEAR No No Lycos Yes No No No No Mojeek Yes No NEAR No No Mozbot Yes No No No No Yahoo! Search Yes No No #, ! No Yandex Yes No No !, * No

Table 4. WSEs’ access restrictions and mathematical coherence of their hit counts. Search Maximum queries per Hit count mathematical WSE API day/IP coherence

AOL Search Yes Unlimited Yes Bing Yes 170 (free API) No Ecosia No Unlimited No Entireweb Yes 1,000 Yes Gibiru No 1,000 Yes Gigablast Yes 1,024 No Google Yes 100 Yes Hotbot No Unknown No Lycos Yes Unknown No Mojeek Yes 1,000 (payment version) Yes (exact) Mozbot No 1,000 Yes Yahoo! Search Yes 100,000 No Yandex Yes 10,000 Yes

4. Application-oriented quantitative evaluation

Even though the former qualitative analysis provides insights on the potential suitability of the WSEs in linguistic research, they don’t answer a crucial question: are the hit counts accurate estimators of the Web/social information distribution? In this section, we bring some light to this question with an application-oriented evaluation 174 Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178

that uses WSEs’ hit counts as input in one of a core task of computational linguistics that closely depends on the (social) information distribution: the estimation of the semantic similarity/distance between linguistic entities. Semantic similarity/distance quantifies the resemblance between the meaning of two linguistic terms or concepts, a dimension that can be measured in several ways and by exploiting different information/knowledge sources (e.g., raw or structured textual corpora, ontologies, thesauri, etc.) [32]. Being the semantic distance the linguistic equivalent of the arithmetic distance for numbers, its calculation is crucial in many algorithms dealing with linguistic data (e.g., classification and clustering o documents, semantic disambiguation, etc.), whose outcomes thus closely depend on the accuracy of the semantic distance calculation [32]. In the context of our study, we adopt the perspective of distributional semantics, that is, the calculation of the semantic similarity between linguistic entities from their distributional characteristic in large samples of language data (in our case, the Web). The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings; or, in other words, items that tend to co-occur in a discourse would likely be semantically related or similar [33]. If we translate this notion to the context of WSEs, we can use their hit count to estimate i) the frequency of textual terms by individually querying them to the WSE and ii) the co-occurrence of pairs of terms within the same context (i.e., a web page) by concatenating them within the same query with the AND operator. Then, by dividing both values by the total number of web pages indexed by the WSE, we are able to obtain both the marginal and joint probabilities of a pair of linguistic entities at a Web scale and, from these, estimate their semantic resemblance according to a similarity coefficient. In the following, we detail some of the similarity coefficients that have been adapted to measure the semantic resemblance of linguistic terms from the hit counts provided by WSEs (and that we use here as means to evaluate the practical accuracy of hit counts): • Pointwise mutual information (PMI): measures how much the probability of co-occurrence of two events (p(a, b)) differs from what we would expect it to be on the basis of the probabilities of the individual events (p(a) and p(b)) and the assumption of independence between them. Probabilities have been approximated to the WSEs’ hit count in the seminal work by Turney, which focused on discovering synonyms [8]:

§·hits "" a AND "" b ¨¸ pab , ©¹total_ webs PMI a, b log10| log10 (1) pa u pb hits "" a hits "" b u total__ webs total webs

• Normalized PMI (NPMI): is the normalized version of the previous function (in the [-1,+1] range):

§·hits "" a AND "" b ¨¸ total_ webs log10 ©¹ hits "" a hits "" b u PMI a, b total__ webs total webs NPMI a, b | (2) log10 pab , §·hits "" a AND "" b log10¨¸ ©¹total_ webs

• Symmetric conditional probability (SCP): is the product of the conditional probabilities of the two events (a and b). In [34] this measure was adapted to use Web hit counts to discover bigrams, as follows:

2 §·hits "" a AND "" b 2 ¨¸ pab , ©¹total_ webs SCP a, b | (3) pa u pb hits "" a hits "" b u total__ webs total webs Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178 175

• Normalized Google Distance (NGD): is a measure designed ad-hoc to estimate the semantic distance between pairs of terms from the hit count provided by the Google search engine [2]:

max log10 hits " a " ,log10 hits " b " log10 hits " a " AND " b " NGD a, b (4) log10 total _ webs min log10 hits " a " ,log10 hits " b "

The usual way to evaluate the accuracy of semantic similarity/distance measures such as (1)-(4) consists on comparing their assessments with the similarity ratings provided by human experts on a set of term pairs. Specifically, by measuring the correlation between the similarity/distance values resulting from computerized measures and those provided by human experts, we can evaluate how well the former have mimic human judgements on semantics. The most well-known semantic similarity benchmark was proposed by Rubenstein and Goodenough [35] and consists of 65 English noun pairs with averaged similarity ratings (in a 0-4 scale) provided by 51 human subjects. We followed this procedure to evaluate the practical accuracy of the hit counts of WSEs as estimators of the Web’s information distribution. For each WSE, we measured the similarity/distance between the term pairs in the Rubenstein and Goodenough benchmark by using the WSE’s hit counts into the above-described functions ((1) to (4)). The similarity/distance ratings obtained for each function for the set of term pairs was then compared with the human similarity ratings for the same pairs in the Rubenstein and Goodenough benchmark according to the following metrics: • Pearson correlation (r): measures the linear statistical dependence between two variables in the range [+1...−1], where 1 indicates that the variables are totally dependent, 0 means independence and -1 means inverse dependence. We computed the Pearson correlation between the hit count-based similarity/distance ratings obtained for each function and WSE and the human ratings in the benchmark; with this we quantify up to which degree the hit counts provided by each WSE are suitable estimators of term probabilities. • p-value of the correlation: measures how extreme the observed correlation results are and, thus, the probability that the observed correlation has nothing to do with what one is actually testing for; a p-value value below 0.05 is consensually considered a proof of statistical significance. We used it to test the significance of the Pearson correlation values and, more specifically, to evaluate whether the correlation differences observed for the different WSEs are significant enough to extract conclusions of their behaviour.

As a reference for evaluating correlation values, the intra-rating Pearson correlation between the repetitions of the Rubenstein and Goodenough experiment was r=0.85. Since this value reflects the discrepancy between human ratings, it states an upper correlation bound for computerized approaches. To retrieve the hit counts we need to measures similarities/distances from the WSEs we used the Selenium Python library (http://docs.seleniumhq.org/) to simulate query requests from different Web browsing sessions. In this way we avoid the access limitations that some WSEs impose regarding the number of consecutive queries allowed per IP. The total_webs constant for the different WSEs used in functions (1) to (4) was defined either according to the WorldWideWebSize.com source (for the WSEs it covers) or by querying the general-purpose cross- language term “a” and multiplying the resulting hit count by 1.5 (for safety). Table 5 depicts the Pearson correlation (and p-values) for the WSEs considered in our study with respect to the Rubenstein and Goodenough benchmark.

5. Discussion

Several conclusions can be extracted from the results reported in Sections 3 and 4. First, from the application- oriented evaluation in Table 5 it is surprising to see the poor results provided by some well-known WSEs; specifically, the widely-used Bing and Yahoo! Search engines provided near random assessments for most measures: correlation values were very close to 0 and the p-values were much higher than 0.05. This matches with the lack of mathematical coherence of hit counts detected during the qualitative evaluation (last column of Table 4). In this respect, Bing seems to have the worst behaviour and, because Bing’s hit counts (formerly Live Search) have 176 Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178

been used in the past with reasonable results [13, 16, 31], this seems to indicate that the mathematical coherence of the hit count of the Microsoft’s search solution has degraded recently. Only Google provided a good (albeit not great) correlation, with results that are statistically significant across all the measures; moreover, from the qualitative evaluation in Section 3, we can see that Google is the WSE that offers the greatest query flexibility, including non- exact searches (only nouns are supported in part-of-speech searches), proximity operators and regular expressions.

Table 5. Pearson correlation (r) and p-value with respect to the human similarity ratings in the Rubenstein and Goodenough benchmark for the PMI, NPMI, SCP and NGD similarity/distance measures when using the hit count of the WSEs considered in this study to estimate term (co-)occurrence probabilities. WSE name PMI NPMI SCP NGD r p-value r p-value r p-value r p-value AOL Search 0.344 0.002 0.372 0.001 0.165 0.094 -0.345 0.002 Bing 0.064 0.306 0.097 0.220 -0.004 0.488 -0.067 0.299 Ecosia 0.103 0.207 0.116 0.179 0.011 0.464 -0.086 0.249 Entireweb 0.251 0.022 0.318 0.005 0.164 0.096 -0.252 0.021 Gibiru 0.553 8.55E-07 0.605 4.29E-08 0.336 0.003 -0.555 7.40E-07 Gigablast 0.098 0.219 0.072 0.285 0.159 0.103 -0.092 0.232 Google 0.444 1.04E-04 0.467 4.34E-05 0.316 0.005 -0.456 6.64E-05 Hotbot 0.259 0.019 0.324 0.004 0.164 0.096 -0.260 0.018 Lycos 0.105 0.202 0.135 0.142 0.017 0.447 -0.097 0.222 Mojeek 0.591 1.03E-07 0.660 101E-09 0.334 0.003 -0.612 2.67E-08 Mozbot 0.348 0.002 0.199 0.056 -0.092 0.232 -0.348 0.002 Yahoo! Search 0.189 0.066 -0.160 0.101 0.160 0102 -0.184 0.072 Yandex -0.268 0.015 -0.248 0.023 -0.054 0.335 0.274 0.014

Also from Table 5, only three WSEs provided statistically significant results in all cases (Mojeek, Gibiru and Google), being Mojeek and Gibiru more accurate than Google. Mojeek was especially interesting because it was not only able to provide the best correlation in our application-oriented evaluation, but it was also the only WSE providing hit counts with perfect mathematical coherence when inversing the order of term pairs (Table 4). The better results provided by Mojeek and Gibiru with respect to Google could be explained by the fact that the former are not driven by commercial interests and, contrary to most WSEs, they provide uncensored and unbiased searches. If the goal is analysing the Web’s information distribution, one can imagine that the lack of bias in the results has a positive effect in making the hit count less arbitrary, more robust and more representative of the true information distribution in society. Moreover, Mojeek and Gibiru allow a significantly larger number of queries than Google, even though Mojeek’s API is not free. Their only drawback is the fact that their query languages are restricted to the very basic Boolean operators, which is something to be considered in more complex linguistic analyses. In a second tier category, AOL Search, Entireweb and Hotbot provided “usable” and statistically significant results for all measures expect the SCP, albeit lower than the three WSEs discussed above. From these, AOL Search may be useful in some non-critical tasks requiring a lot of queries, because it does not impose access limitations. The remaining WSEs provided results too arbitrary/near random to be used in research beyond the anecdotal. Comparing the semantic similarity/distance measures in Table 5, we can see that NPMI tends to provide the most accurate results, closely followed by PMI and NGD and, at a long distance, by SCP. The normalization introduced in the NPMI measure seems to make it less dependent on the size of the Web corpora; moreover, the logarithmic quantification of NPMI, PMI and NGD seems to better capture the association between the distributions of linguistic entities and their semantic relationships that, as stated in [36], is not linearly dependent. By looking at the raw correlation values it is also interesting to see the high correlation obtained by Gibiru and Mojeek for the NPMI measure (both above 0.6). Considering the human correlation upper-bound for the benchmark we used (0.85), these results can be considered as a reasonably good approximation of human judgements on Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178 177 semantics. Moreover, we should also consider the bare bone nature of the measures and queries we employed. The literature is full of more elaborated distributional measures [33] that, by relying on some degree of supervision [37] or, by employing more complex queries to minimize language ambiguity [8, 16], are able to provide more accurate results nearer to the upper bound. We deliberately left these more sophisticated measures outside our study to minimize the number of variables to consider (e.g., tuning parameters, degree of supervision, supported/unsupported query operators) and to make the results only dependent on the performance of the WSEs’ hit counts.

6. Conclusion and future work

Many researchers use the Web as linguistic data source for knowledge discovery tasks and, specifically, commercial Web search engines as the low-entry-cost way to access to Web data. In such tasks, the suitability of the WSE’s hit count as frequency estimator is crucial, because the accuracy of the outcomes closely depend on the “quality” of hit counts. Even though the choice of a particular WSE has been usually relegated by researchers, in this study we have shown that there are very significant differences among WSEs and that the most well-known (and widely-used) WSEs are not the best-suited for linguistic research (Bing and Yahoo! Search provided particularly poor results and Google came third in our application-oriented evaluation). Our evaluation was twofold; first, we analysed the potential suitability of WSEs’ hit counts as general-purpose estimators of the probability of (co-)occurrence of linguistic terms; then, we evaluated their actual performance in one of the core tasks of computational linguistics: the estimation of the semantic similarity between terms. In comparison with other similar studies [29-31], we believe our work provides a number of contributions that would be of interest for linguistic researchers. On the one hand, we offer an up-to-date and a much broader comparison of the currently available WSEs. On the other hand, we evaluate the practical accuracy of WSE hit counts in a core task of linguistic research. This enabled us to identify new promising WSEs (Mojeek and Gibiru), which provided hit counts more accurate than search engines widely used in research, such as Google, Yahoo or Bing. As future work, we plan to widen our analysis in several ways. First, as done in some other studies [30, 31], we plan to test the consistency of the hit counts through time (both in general and also in application-specific scenarios), because a good consistency is needed to provide reliable research outcomes and to facilitate reproducibility. Second, we plan to test WSEs with semantic similarity benchmarks including rarer or domain specific words; this would not only evaluate the accuracy of the hit counts, but also the indexation recall of WSE and the statistical significance of the frequencies they provide. Finally, we would also consider additional mechanisms or strategies that can be employed to obtain more robust assessments or to alleviate the limitations of WSE regarding language ambiguity; for example, by using proximity search operators (NEAR/PHRASE) instead of the AND to force a stronger semantic relationship, or concatenating additional terms to the queries in order to disambiguate polysemic terms [16].

Acknowledgements

This work was partly supported by the European Commission under the H2020 project “CLARUS”, by the Spanish Government through project TIN2014-57364-C2-R “SmartGlacis” and by the Government of Catalonia under grant 2014 SGR 537. The opinions in this paper are those of the authors and do not necessarily reflect the views of UNESCO.

References

1. Tesauro G, Gondek DC, Lenchner J, Fan J, Prager JM. Analysis of Watson's strategies for playing Jeopardy! Journal of Research 2013:47:205-251. 2. Cilibrasi RL, Vitányi PMB. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering 2006:19(3):370-383. 3. Chow R, Golle P, Staddon J. Detecting Privacy Leacks Using Corpus-based Association Rules. In: 14th Conference on Knowledge Discovery and Data Mining. Las Vegas, NV: ACM; 2008. p. 893-901. 4. Sánchez D, Moreno A. Learning non-taxonomic relationships from web documents for domain ontology construction. Data & Knowledge Engineering 2008:63(3):600-623. 5. Sánchez D, Batet M. -sanitized: A privacy model for document redaction and sanitization. Journal of the Association for Information Science and Technology 2016:67(1):148-163. 178 Laura Martínez-Sanahuja and David Sánchez / Procedia Computer Science 96 ( 2016 ) 169 – 178

6. Kilgarriff A. Googleology is Bad Science. Computational Linguistics 2007:33(1):147-151. 7. Grefenstette G. The WWW as a resource for example-based MT tasks. In: ASLIB Conference on Translating and the Computer 21. London, U.K.; 1999. 8. Turney PD. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. In: 12th European Conference on Machine Learning, ECML 2001. Freiburg, Germany: Springer-Verlag; 2001. p. 491-502. 9. Keller F, Lapata M. Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 2003:29(3):459-484. 10. Nakov P, Hearst M. Search engine statistics beyond the n-gram: Application to noun compound bracketing. In: Ninth Conference on Computational Natural Language Learning. Ann Arbor, Michigan, US; 2005. p. 17-24. 11. Sánchez D, Moreno A. Pattern-based automatic taxonomy learning from the Web. AI Communications 2008:21(1):27-48. 12. Sánchez D. A methodology to learn ontological attributes from the Web. Data & Knowledge Engineering 2010:69(6):573-597. 13. Sánchez D, Moreno A, Vasto-Terrientes LD. Learning relation axioms from text: An automatic Web-based approach. Expert Systems with Applications 2012:39(5):5792-5805. 14. Etzioni O, Cafarella M, Downey D, Popescu A, Shaked T, Soderland S, Weld D, Yates A. Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence 2005:165:91-134. 15. Sánchez D, Isern D. Automatic extraction of acronym definitions from the Web. Applied Intelligence 2011:34(2):311-327. 16. Sánchez D, Batet M, Valls A, Gibert K. Ontology-driven web-based semantic similarity. Journal of Intelligent Information Systems 2010:35(3):383-413. 17. Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity between Words Using Web Search Engines. In: 16th international conference on World Wide Web, WWW 2007. Banff, Alberta, : ACM Press; 2007. p. 757-766. 18. Sánchez D, Castellà-Roca J, Viejo A. Knowledge-based scheme to create privacy-preserving but semantically-related queries for web search engines. Information Sciences 2013:218:17-30. 19. Viejo A, Sánchez D, Castellà-Roca J. Preventing automatic user profiling in Web 2.0 applications. Knowledge-Based Systems 2012:36:191- 205. 20. Sánchez D, Batet M, Viejo A. Utility-preserving sanitization of semantically correlated terms in textual documents. Information Sciences 2014:279:77-93. 21. Chapelle O, Chang Y. Yahoo! Challange Overview. In: Yahoo! Learning to Rank Challenge at ICML 2010. Haifa, Israel; 2011. p. 1-24. 22. Brill E. Processing Natural Language without Natural Language Processing. In: 4th International Conference on Computational Linguistics and Intelligent Text Processing and Computational Linguistics, CICLing 2003. Mexico City, Mexico: Springer Berlin / Heidelberg; 2003. p. 360-369. 23. Lewandowski D. Evaluating the retrieval effectiveness of Web search engines using a representative query sample. Journal of the Association for Information Science and Technology 2015:66(9):1763-1775. 24. Macfarlane A. Evaluation of web search for the information practitioner. Aslib Proceedings: New Information Perspectives 2007:59(4- 5):352-366. 25. Deka SK, Lahkar N. Performance evaluation and comparison of the five most used search engines in retrieving web resources. Online Information Review 2010:34(5):757-771. 26. Bilal D. Ranking, relevance judgment, and precision of information retrieval on children's queries: Evaluation of Google, Yahoo!, Bing, Yahoo! Kids, and ask Kids. Journal of the American Society for Information Science and Technology 2012:63(9):1879-1896. 27. Zhang J, Fei W. Search engines? responses to several search feature selections. The International Information & Library Review 2010:42(3):212-225. 28. Thelwall M. Quantitative comparisons of search engine results. Journal of the American Society for Information Science and Technology 2008:59(11):1702-1710. 29. Uyar A. Investigation of the accuracy of search engine hit counts. Journal of Information Science 2009:35(4):469-480. 30. Satoh K, Yamana H. Hit Count Reliability: How Much Can We Trust Hit Counts? In: 14th Asia-Pacific international conference on Web Technologies and Applications. Springer; 2012. p. 751-758. 31. Funahashi T, Yamana H. Reliability Verification of Search Engines’ Hit Counts: How to Select a Reliable Hit Count for a Query. In: Current Trends in Web Engineering: Springer; 2010. p. 114-125. 32. Batet M, Sánchez D. Review on Semantic Similarity. In: Encyclopedia of Information Science and Technology (3rd edition): IGI Global; 2014. p. 7575-7583. 33. Mohammad S, Hirst G, Distributional Measures of Semantic Distance: A Survey, in http://arxiv.org/abs/1203.1858. 2006. 34. Downey D, Broadhead M, Etzioni O. Locating complex named entities in Web text. In: 20th International Joint Conference on Artificial Intelligence, IJCAI 2007. Hyderabad, : AAAI 2007. p. 2733-2739. 35. Rubenstein H, Goodenough J. Contextual correlates of synonymy. Communications of the ACM 1965:8(10):627-633. 36. Lemaire B, Denhière G. Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters - Behaviour, Brain and Cognition 2006:18(1):1. 37. Bollegala D, Matsuo Y, Ishizuka M. A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web. In: Conference on Empirical Methods in Natural Language Processing, EMNLP 2009. Singapore, Republic of Singapore: ACL and AFNLP; 2009. p. 803–812.