Towards an Open Web Index: Lessons from the Past
Total Page:16
File Type:pdf, Size:1020Kb
Towards an Open Web Index: Lessons From the Past Michael Völske,y Janek Bevendorff,y Johannes Kiesel,y Benno Stein,y Bauhaus-Universität Weimar, Germany Maik Fröbe,∗ Matthias Hagen,∗ Martin-Luther-Universität Halle-Wittenberg, Germany Martin Potthast,z Leipzig University, Germany Abstract without effective retrieval, web search indexes should be Recent efforts towards establishing an open and independent Eu- regarded as serving an essential public need, and like other ropean web search infrastructure have the potential to contribute essential infrastructure should be resilient to breakdown, notably to more resilient, more equitable, and ultimately more and thus redundant [1]. From a global point of view, being effective information access in Europe and beyond. In this article, as independent as possible from existing providers achieves we recapitulate what we believe to be important goals and chal- this redundancy. lenges that this effort should aspire to, and review how past web (2) Scale. Creating and maintaining a useful web index requires search endeavors have fared with respect to these criteria. significant computational resources, as documents need to In a nutshell: For the past twenty years, no independent search be crawled, stored, indexed, and re-crawled to maintain engine has been able to establish itself as a fully viable alternative freshness [3]. For instance, Google reported an index size to the major players. The attempts so far underline the impor- exceeding 100 PiB in 2017 spread across fifteen major data tance of both collaboration to close the scale gap and innovative centers worldwide [12, 15]. new ideas for funding and bootstrapping a new contender. Future (3) User data. Bootstrapping a successful search engine gen- entrants to the web search market are well-advised to take note erally hinges on the chicken-and-egg problem of user data of existing approaches to collecting relevance feedback data in a acquisition. While a web search frontend can certainly func- privacy-protecting manner. tion without tracking users [2, 27], it proves difficult if not impossible to provide the best possible service without rel- INTRODUCTION evance feedback in a competitive environment where user The Open Search Foundation strives for a more open, diverse, and tracking is the norm [7]. Any potential gains in search ef- transparent web search landscape that offers a variety of indepen- fectiveness through user data collection must be reconciled dent search engines. Today’s web search market, where a single with the related, equally important, and yet somewhat con- market leader has held onto a global market share of over 90% tradictory goal of protecting users’ privacy, which poses for more than a decade, is in many ways the opposite of this ideal. technical challenges of its own [4, 31]. Specifically from a European perspective, various pundits advocate (4) Market penetration. Closely related to the previous issue, for a more independent domestic digital infrastructure [28, 29]. even the best piece of technology is useless unless it is used. On occasion, such arguments can appear borderline chauvinistic Key to the development of an open search infrastructure is in their favoring, or even regarding as somehow inherently superior, a marketing strategy that overcomes the barriers to market domestic technology for its own sake. Nevertheless, we believe entry due to the virtual omnipresence of one contender. It is that there are good reasons to aspire to establish a new, viable, futile to think that an alternative search engine will reach a alternative. Since a good portion of contemporary economic and wide adoption based only on it being against the incumbent. social activity depends on information access facilitated by web A clear-cut unique selling proposition must be obtained and search to some extent, search engine monoculture presents a risk defended, or else end users will stick to what they are used to the societies and economies that depend on it. The recognition to and what is most convenient. of this risk is nothing new, as evidenced by a long history of efforts (5) Funding. Commercial search providers tend to finance their directed toward establishing alternatives; however, none so far can operations through selling ads, or selling search engine re- be considered fully successful. sult page (SERP) placement directly [21]. From the per- This article is first and foremost an appeal to heed the lessons spective of a new contender in open search infrastructure, that can be learned from the challenges that alternative web search this approach appears forlorn. Not only might it contradict engines have encountered in the past. In order to provide a frame- aspirations toward independence, but the market situation is work and context for these efforts, we summarize in the following vastly different today compared to more than 20 years ago, what we consider to be the primary goals and challenges that an when digital advertising first established itself. This makes open web search engine must address. finding a funding model compatible with open search akey challenge that must be overcome. (1) Independence. In today’s interconnected world, access to (6) Transparency. In their role as “gatekeepers to information,” information can be considered almost as critical to day-to- the major commercial online services are coming under day life as fundamental infrastructure such as water supply increasing scrutiny, due in large part to the opaqueness of and hospitals. The information flood being unmanageable their operations [14]. Clearly, in order to be trusted, an open *<first-name¡.<last-name¡@informatik.uni-halle.de search engine cannot be a black box; at the same time, a † <first-name¡.<last-name¡@uni-weimar.de highly-transparent search engine brings its own challenges, ‡[email protected] such as being more easily exploitable by parties interested in manipulating its rankings [20]. In what follows, we examine a variety of past attempts at build- has so far remained more of a technical curiosity than a practi- ing alternative search engines with respect to how they address—or cal and widely used search engine, however. Smaller contenders fail to address—the above goals and challenges. Following that, with independent search indexes include GigaBlast, Mojeek, and we outline a few ideas that we believe can help future endeavors Exalead, which have all been active for more than fifteen years, but fare better in these respects. don’t seem to match the search result quality of the major search engines [23]. LESSONS FROM THE PAST There have been many previous attempts at addressing the chal- Scaling a New Search Engine lenges facing alternative search providers. Table 1 surveys previous Indexing the web requires a massive investment in infrastructure attempts at establishing general-purpose search engines, with a and it is easy to get lost in naive assumptions about the amount focus on recently-active endeavors, and evaluates them with re- of resources needed, while on the other hand, we can assume spect to the goals and challenges identified in the previous section. that one does not need to outscale Google in order to provide While at first glance it seems that the number of available options a useful and competitive search engine. Overall, the size of the is quite large, closer inspection makes clear that many of them do indexed web is estimated at around 60 billion* web pages, which not satisfy our criteria. The following sections examine how each easily exceeds the largest Common Crawl to date by an order of of the challenges has been tackled in the past. magnitude. Today, the Internet Archive stores around 60–70PB of archival data. The Wayback machine alone had indexed well Attempts at Fully Independent Alternatives over 20PB of data as of 2018 [18]. This number serves as a good For a search engine to be fully independent, it needs its own base estimate for a representative sample of the indexable web. crawling infrastructure to feed its own index and serve it to users To provide a live index with complete full-text search and timely with their own algorithm. The price tag of crawling and indexing updates, however, the storage requirements can easily multiply. By the whole web can be put at around one or two years of time and seeding the crawler with only the top-ranked domains, the size of well over one billion Dollars in cash [8], so to build a search engine the index can be reduced significantly at the cost of completeness. fully independent of existing competitors is a hard if not impossible A simple full-text index of a 1.6–2.1-billion document Common challenge for any newcomer. Despite the sheer volume of niche Crawl snapshot can be built at around 20–30TB with an additional search engines, hardly any of them can actually be considered 30TB for holding the original cached HTML pages [2]—about the independent. Besides Microsoft’s Bing and the long-established size of Google’s index back in 2004 [11]. Such an index contains regional search engines Baidu and Yandex, only few operate their no multimedia content, no user data, no knowledge graphs, and no own index infrastructure and their own ranking algorithms. The recent updates. Adding overhead for redundant data storage at a recent shutdown of the Cliqz search engine has shown once again factor of at least 1.5 or 2.0, additional space for storing fresh crawls, how high the entry barriers to the search market are, especially for user logs, archival of old data, as well as space for processing and companies who want to develop and maintain their own technology indexing new data, it is safe to assume that a minimum capacity stack. Cliqz was a search engine that was “private by default” with of 50PB has to be planned for at the low end with an additional a custom index and a ranking based upon the “Human Web” [6], 25% on top to provide sufficient room for rebalancing data in the a sink for anonymized and unlinkable user click and web traffic case of outages and as a general buffer before new hardware has to statistics totally devoid of directly or indirectly user-identifiable be acquired.