Data Management Projects at Google

Data Management Projects at Google Michael Cafarella Edward Chang Andrew Fikes Alon Halevy Wilson Hsieh Alberto Lerner Jayant Madhavan S. Muthukrishnan 1. INTRODUCTION values in the form’s fields. Since web crawlers primarily rely This article describes some of the ongoing research projects on hyperlinks to discover web pages, they are unable to reach related to structured data management at Google today. pages on the Deep Web that are subsequently not indexed The organization of Google encourages research scientists by search engines. The Deep Web has been acknowledged to work closely with engineering teams. As a result, the as a significant gap in the coverage of search engines and research projects tend to be motivated by real needs faced various accounts have hypothesized the Deep Web to have by Google’s products and services, and solutions are put an much more data than the currently searchable World into production and tested rapidly. In addition, because of Wide Web. Included in the Deep Web are a large number the sheer scale at which Google operates, the engineering of high quality sites, such as store locators and government challenges faced by Google’s services often require research sites. Hence, we would like to extend the coverage of the innovations. Google search engine to include web pages from the Deep In Google’s early days, structured data management was Web. mostly needed for storing and serving data related to ads. There are two complementary approaches to offering ac- However, as the company grows into hosted applications and cess to Deep Web content. The first approach, essentially the analyses performed on its query streams and indexes get a data integration solution, is to create vertical search en- more sophisticated, structured data management is becom- gines for specific domains (e.g., cars, books, real-estate). In ing a key infrastructure in all parts of the company. this approach we could create a mediator form for the do- What we describe below is a subset of ongoing projects, main at hand and semantic mappings between individual not a comprehensive list. Likewise, there are others who are data sources and the mediator form. At web-scale, this ap- involved in structured data management projects, or have proach suffers from several drawbacks. First, the cost of contributed to the ones described here, some of whom are building and maintaining the mediator forms and the map- Roberto Bayardo, Omar Benjelloun, Vignesh Ganapathy, pings is high. Second, identifying the domain, and the forms Yossi Matias, Rob Pike and Ramakrishnan Srikant. within the domain, that are relevant to a keyword query is Sections 2 and 3 describe projects whose goal is to enable extremely challenging. Finally, data on the web is about search on collections of structured data that exist today on everything and domain boundaries are not clearly definable, the web. Section 2 describes our efforts to crawl content that not to mention the many different languages – creating a resides behind forms on the web, and Section 3 describes mediated schema of everything will be an epic challenge. our initial work on enabling search on collections of HTML The second approach, sometimes called the surfacing ap- tables. Section 4 describes work on mining large collections proach, pre-computes the most relevant form submissions of data and social graphs. Sections 5 and 6 describe recent for all interesting HTML forms. The URLs resulting from progress on BigTable, our main infrastructure for storing these submissions can then be indexed like any other HTML structured data. page. Importantly, this approach enables leveraging the ex- isting search engine infrastructure and hence the seamless inclusion of Deep Web pages into web search results and 2. CRAWLING THE DEEP WEB this leads us to prefer the surfacing approach. We note that Jayant Madhavan and Alon Halevy our goal is to drive new traffic to Deep Web sites that until The Deep Web refers to content hidden behind HTML now were visited only if people know about the form or if forms. In order to get to a web page from the Deep Web, the form itself came up in a search result. Consequently, it a user has to perform a form submission with valid input is not crucial that we obtain all the possible form submission from these sites, but just enough to drive more traffic. Fur- thermore, our pre-computed form submissions function like a seed to unlocking the site – once an initial set of pages are in the index, the crawling system will automatically leverage the internal structure of the site to discover other pages of interest. We have developed a surfacing system at Google that has already enhanced the coverage of our search index to include web pages from over a million HTML forms. We are sub- 34 SIGMOD Record, March 2008 (Vol. 37, No. 1) sequently able to drive over a thousand queries per second lows the content providers to give lists of URLs in XML files from the Google.com search page to Deep Web content. to search engines, and therefore expose content behind from In deploying our solution, we had to overcome several chal- the deep web. All major search engines today support the lenges. First, a large number of forms have text box inputs site maps protocol described in www.sitemaps.org. The con- and require valid inputs values to be submitted. Therefore, tent provided by site maps tends to be complimentary to the the system needs to choose a good set of values to submit in content that is automatically crawled using the techniques order to surface the most useful result pages. We use a com- described above. bination of two approaches to address this challenge. For search boxes, which accept most keywords, we predict good candidate keywords by analyzing the content of already in- 3. SEARCHING HTML TABLES dexed pages of the website. For typed text boxes, that only Michael Cafarrela and Alon Halevy accept a well-defined set of values, we attempt to match the The World-Wide Web consists of a huge number of un- type of the text box against a library of types that are ex- structured hypertext documents, but it also contains struc- tremely common across domains, e.g., zip codes in the US. tured data in the form of HTML tables. Some of these ta- Second, HTML forms typically have more than one input bles contain relational-style data, with tuple-oriented rows and hence a naive strategy of enumerating the entire Carte- and a schema that is implicit but often obvious to a human sian product of all possible inputs can result in a very large observer. Indeed, these HTML tables make up the largest number of URLs being generated. Crawling too many URLs corpus of relational databases that we are aware of (number- will drain the resources of a search engine web crawler while ing more than 100M databases, with more than 2M unique also posing an unreasonable load on web servers hosting the schemas). The WebTables project is an effort to extract HTML forms. Interestingly, when the Cartesian product is high-quality relations from the raw HTML tables, to make very large, it is likely that a large number of the form sub- these databases queryable, and to use this unique dataset to missions result in empty result sets that are useless from build novel database applications. an indexing standpoint. For example, the search form on The first WebTables task is to recover high-quality rela- cars.com has 5 inputs and a Cartesian product will yield over tions from a general web crawl. Relation recovery includes 200 million URLs, even though cars.com has only 650,000 two steps: first, WebTables filters out raw HTML tables that cars on sale. We have developed our algorithm that intel- do not carry relational data (such as those used for page lay- ligently traverses the search space of possible form submis- out). Second, for those tables that pass the relational filter, sions to identify only the subset of input combinations that WebTables recovers schema metadata such as column types are likely to be useful to the search engine index. On aver- and labels. There is no way for an HTML author to reli- age, we only generate a few hundred form submissions per ably indicate whether a table is relational, or to formally form. Furthermore, we believe the number of form submis- declare relational metadata. Instead, WebTables must rely sions we generate is proportional to the size of the database on a host of implicit hints. For example, tables that are used underlying the form site, rather than the number of inputs for page layout will often contain very long and text-heavy and input combinations in the form. cells, whereas tables for relational data will usually contain Third, our solutions must scale and be domain indepen- shorter cells of roughly consistent length. Similarly, one way dent. There are millions of potentially useful forms on the to test whether a table author has inserted a “header row” web. Given a particular form, it might be possible for a hu- is to see if the first row is all strings, with different types in man expert to determine through laborious analysis the best the remaining rows. possible submissions for that form, but such a solution would The second WebTables challenge is to design a query tool not scale. Our goal was to find a completely automated so- that gives easy access to more than a hundred million unique lution that can be applied to any web form in any language databases with more than 2 million unique schemas.

Load more