<<

Data Management Projects at

Michael Cafarella Edward Chang Andrew Fikes Alon Halevy Wilson Hsieh Alberto Lerner Jayant Madhavan S. Muthukrishnan

1. INTRODUCTION values in the form’s fields. Since web crawlers primarily rely This article describes some of the ongoing research projects on hyperlinks to discover web pages, they are unable to reach related to structured data management at Google today. pages on the Deep Web that are subsequently not indexed The organization of Google encourages research scientists by search engines. The Deep Web has been acknowledged to work closely with engineering teams. As a result, the as a significant gap in the coverage of search engines and research projects tend to be motivated by real needs faced various accounts have hypothesized the Deep Web to have by Google’s products and services, and solutions are put an much more data than the currently searchable World into production and tested rapidly. In addition, because of Wide Web. Included in the Deep Web are a large number the sheer scale at which Google operates, the engineering of high quality sites, such as store locators and government challenges faced by Google’s services often require research sites. Hence, we would like to extend the coverage of the innovations. engine to include web pages from the Deep In Google’s early days, structured data management was Web. mostly needed for storing and serving data related to ads. There are two complementary approaches to offering ac- However, as the company grows into hosted applications and cess to Deep Web content. The first approach, essentially the analyses performed on its query streams and indexes get a data integration solution, is to create vertical search en- more sophisticated, structured data management is becom- gines for specific domains (e.g., cars, books, real-estate). In ing a key infrastructure in all parts of the company. this approach we could create a mediator form for the do- What we describe below is a subset of ongoing projects, main at hand and semantic mappings between individual not a comprehensive list. Likewise, there are others who are data sources and the mediator form. At web-scale, this ap- involved in structured data management projects, or have proach suffers from several drawbacks. First, the cost of contributed to the ones described here, some of whom are building and maintaining the mediator forms and the map- Roberto Bayardo, Omar Benjelloun, Vignesh Ganapathy, pings is high. Second, identifying the domain, and the forms Yossi Matias, Rob Pike and Ramakrishnan Srikant. within the domain, that are relevant to a keyword query is Sections 2 and 3 describe projects whose goal is to enable extremely challenging. Finally, data on the web is about search on collections of structured data that exist today on everything and domain boundaries are not clearly definable, the web. Section 2 describes our efforts to crawl content that not to mention the many different languages – creating a resides behind forms on the web, and Section 3 describes mediated schema of everything will be an epic challenge. our initial work on enabling search on collections of HTML The second approach, sometimes called the surfacing ap- . Section 4 describes work on mining large collections proach, pre-computes the most relevant form submissions of data and social graphs. Sections 5 and 6 describe recent for all interesting HTML forms. The URLs resulting from progress on , our main infrastructure for storing these submissions can then be indexed like any other HTML structured data. page. Importantly, this approach enables leveraging the ex- isting search engine infrastructure and hence the seamless inclusion of Deep Web pages into web search results and 2. CRAWLING THE DEEP WEB this leads us to prefer the surfacing approach. We note that Jayant Madhavan and Alon Halevy our goal is to drive new traffic to Deep Web sites that until The Deep Web refers to content hidden behind HTML now were visited only if people know about the form or if forms. In order to get to a web page from the Deep Web, the form itself came up in a search result. Consequently, it a user has to perform a form submission with valid input is not crucial that we obtain all the possible form submission from these sites, but just enough to drive more traffic. Fur- thermore, our pre-computed form submissions function like a seed to unlocking the site – once an initial set of pages are in the index, the crawling system will automatically leverage the internal structure of the site to discover other pages of interest. We have developed a surfacing system at Google that has already enhanced the coverage of our search index to include web pages from over a million HTML forms. We are sub-

34 SIGMOD Record, March 2008 (Vol. 37, No. 1) sequently able to drive over a thousand queries per second lows the content providers to give lists of URLs in XML files from the Google.com search page to Deep Web content. to search engines, and therefore expose content behind from In deploying our solution, we had to overcome several chal- the deep web. All major search engines today support the lenges. First, a large number of forms have text box inputs site maps protocol described in www..org. The con- and require valid inputs values to be submitted. Therefore, tent provided by site maps tends to be complimentary to the the system needs to choose a good set of values to submit in content that is automatically crawled using the techniques order to surface the most useful result pages. We use a com- described above. bination of two approaches to address this challenge. For search boxes, which accept most keywords, we predict good candidate keywords by analyzing the content of already in- 3. SEARCHING HTML TABLES dexed pages of the website. For typed text boxes, that only Michael Cafarrela and Alon Halevy accept a well-defined set of values, we attempt to match the The World-Wide Web consists of a huge number of un- type of the text box against a library of types that are ex- structured hypertext documents, but it also contains struc- tremely common across domains, e.g., zip codes in the US. tured data in the form of HTML tables. Some of these ta- Second, HTML forms typically have more than one input bles contain relational-style data, with tuple-oriented rows and hence a naive strategy of enumerating the entire Carte- and a schema that is implicit but often obvious to a human sian product of all possible inputs can result in a very large observer. Indeed, these HTML tables make up the largest number of URLs being generated. Crawling too many URLs corpus of relational databases that we are aware of (number- will drain the resources of a search engine web crawler while ing more than 100M databases, with more than 2M unique also posing an unreasonable load on web servers hosting the schemas). The WebTables project is an effort to extract HTML forms. Interestingly, when the Cartesian product is high-quality relations from the raw HTML tables, to make very large, it is likely that a large number of the form sub- these databases queryable, and to use this unique dataset to missions result in empty result sets that are useless from build novel database applications. an indexing standpoint. For example, the search form on The first WebTables task is to recover high-quality rela- cars.com has 5 inputs and a Cartesian product will yield over tions from a general web crawl. Relation recovery includes 200 million URLs, even though cars.com has only 650,000 two steps: first, WebTables filters out raw HTML tables that cars on sale. We have developed our algorithm that intel- do not carry relational data (such as those used for page lay- ligently traverses the search space of possible form submis- out). Second, for those tables that pass the relational filter, sions to identify only the subset of input combinations that WebTables recovers schema metadata such as column types are likely to be useful to the search engine index. On aver- and labels. There is no way for an HTML author to reli- age, we only generate a few hundred form submissions per ably indicate whether a table is relational, or to formally form. Furthermore, we believe the number of form submis- declare relational metadata. Instead, WebTables must rely sions we generate is proportional to the size of the database on a host of implicit hints. For example, tables that are used underlying the form site, rather than the number of inputs for page layout will often contain very long and text-heavy and input combinations in the form. cells, whereas tables for relational data will usually contain Third, our solutions must scale and be domain indepen- shorter cells of roughly consistent length. Similarly, one way dent. There are millions of potentially useful forms on the to test whether a table author has inserted a “header row” web. Given a particular form, it might be possible for a hu- is to see if the first row is all strings, with different types in man expert to determine through laborious analysis the best the remaining rows. possible submissions for that form, but such a solution would The second WebTables challenge is to design a query tool not scale. Our goal was to find a completely automated so- that gives easy access to more than a hundred million unique lution that can be applied to any web form in any language databases with more than 2 million unique schemas. We or domain. To date, our system has crawled over a million have built a “structured data search engine” in which the forms in over 50 languages and in hundreds of domains. user types a search-style text query, and the engine returns We note that we only index informational form sites. We a relevance-ranked list of databases instead of a list of URLs. take precautions to avoid any form that requires any per- After the user has chosen a relevant database, she can apply sonal information or is likely to have side effects. For exam- more traditional structured query tools (such as selection, ple, we do not analyze forms that use the post method, have projection, etc). Additionally, the engine can automatically password or textarea inputs, or include keywords such as apply certain structured operations without waiting for the username, login, etc. user. For example, WebTables can examine the contents of While our surfacing approach has generated considerable a table and try to generate a visualization that is domain- traffic, there remains a large number of forms that continue appropriate and “interesting,” displaying it next to the table to present a significant challenge to automatic analysis. For in query search results. example, many forms invoke Javascript events in onselect Finally, WebTables uses the corpus of recovered databases and onsubmit tags that enable the execution of arbitrary to build a series of new applications. We have designed two Javascript code, a stumbling block to automatic analysis. so far. The first is schema autocomplete, in which a user en- Further, many forms involve inter-related inputs and access- ters one or more desirable data attributes (e.g., “name”) and ing the sites involve correctly (and automatically) identify- the autocompletor suggests the rest (e.g., “address”, “city”, ing their underlying dependencies. Addressing these and “zip”, “phone”, etc.). The second is synonym finding, a tool other such challenges efficiently on the scale of millions is that automatically computes which table attributes appear part of our continuing effort to make the contents of the to be synonymous (e.g., “song” and “title”, “telephone” and Deep Web more accessible to search engine users. Finally, “tel-#”). This data can then be used to improve schema we also note that site maps are another mechanism that al- matching. Both tools are made possible by attribute-label

SIGMOD Record, March 2008 (Vol. 37, No. 1) 35 co-occurrence statistics derived from the corpus of recovered PSVM is currently used internally at Google for identify- databases. ing spammy and objectionable Web sites. Since PSVM was WebTables works today (available only internally), but we made publicly available, the code has been widely down- believe there are many future research questions. We can im- loaded. prove the performance of existing steps (such as relation re- Besides PSVM, the parallel version of SVD, PLSA, and covery accuracy and database ranking quality), expand the LDA has also been made available at Google internally. input data beyond simple HTML tables (perhaps including These algorithms are useful for tasks of classification and HTML lists or Excel spreadsheets), and build new appli- collaborative filtering. For classification, PLSA is employed cations on the recovered data (such as data-suggestion, a to provide tags for user questions, short , and user “vertical” analog to schema autocompletion). There are also posts. For collaborative filtering, PLSA and LDA are used to a host of questions prompted by a “data-centric” view of the assist various recommendation features, e.g., friend/expert web: we are currently researching whether it is possible to suggestion, forum recommendation, and ads matching. To- automatically find joins between structured data recovered gether, these algorithms power two products which we de- from different web pages. For example, to find where world scribe next. leaders reside, we might join a table of countries and capital The first product is Knowledge Search, which was first cities to a table of countries and their premiers. launched in Russia and then China [12], and is now being The WebTables Project and the Deep-Web Crawl Project launched in several other countries. Knowledge Search al- are parts of our larger research effort into dataspaces [6], and lows users to post questions and then matches experts to on data integration with uncertainty as basis for building timely answer questions. The distinguishing feature of this dataspace systems. Some of our earlier work in this area is product compared to competing products is that it offers described in [4, 5, 8]. online question classification, related-question recommenda- tion, and topic-sensitive expert matching. All of these fea- 4. LARGE-SCALE DATA MINING AND COM- tures are empowered by the aforementioned machine-learning infrastructure. MUNITY PRODUCTS The second product, Laiba, is a social-network product Edward Chang initially developed based on . We first localized the product, and then quickly expanded its features such as We now describe our work on developing scalable algo- photo sharing and user-interaction services. We launched rithms for mining large-scale Web data and social graphs. Laiba in China in 2007 [10]. Similar to Knowledge Search, This work is lead by Edward Chang, who heads Google Re- this product is supported by large-scale data mining algo- search in China. Building upon this scalable data mining in- rithms to support friend/community/content recommenda- frastructure, the engineering team developed and launched tion. The team is now further expanding Laiba to support two social-network products, and drastically reduced page- the Google Open-Social platform [11] that will enable third- rank spam rate in China (from 5% in 2006 to now under party applications to plug Laiba and other social-networks. 1%). The research work focused on parallelizing six mission- critical machine learning algorithms including Support Vec- 5. BIGTABLE tor Machines (SVMs), Singular Vector Decomposition (SVD), Andrew Fikes and Wilson Hsieh Spectral Clustering, Association Mining, Probabilistic La- We have built a system called Bigtable [2] to store struc- tent Semantic Analysis (PLSA), and Latent Dirichlet Allo- tured or semi-structured data at Google. (The Google File cation (LDA) to take advantage of Google’s massive, dis- System [7] is used when a file-system interface is acceptable.) tributed storage and computing services. In particular, his Bigtable can be viewed at a systems level as a distributed, team parallelized SVMs [1], and made the code publicly non-relational database; at the algorithmic level as a highly available through Apache open source. distributed multi-level map; or at the implementation level SVMs are widely used for classification tasks due to their as a variant of a distributed B-tree. We use Bigtable to strong theoretical foundation and empirical successes. Un- store data for many different projects, such as web index- fortunately, SVMs suffer from scalability problems in mem- ing, , and Orkut. ory use and computational time. We developed parallel Bigtable has been under active development since late SVM algorithm (PSVM) to remedy these problems. PSVM 2003, and its first deployment in production was in mid- reduces memory use by performing a row-based approximate 2005. Over the last few years, the deployment of Bigtable matrix factorization, and by loading only essential data to has grown steadily. As of January 2008, there are more than each of the parallel machines. PSVM reduces computation 600 Bigtable clusters at Google; the largest cluster has over time by intelligently reordering computation sequences and 2000 machines. The largest cells store over 700TB of data, by performing them on parallel machines. Furthermore, and the busiest cells sustain 100K operations/second. PSVM supports fault-tolerant computing to recover from Besides our day-to-day “maintenance” work (improving computer-node failures. the performance of the system, fixing bugs, training users, In terms of computational complexity, let n denote the writing documentation, improving manageability, etc.), we number of training instances, p the reduced matrix dimen- are still adding new features to Bigtable. In addition, we sion after factorization (p is significantly smaller than n), continue to redesign parts of the system as our users run and m the number of machines. PSVM reduces the memory 2 larger Bigtable clusters. Following is a brief description of required by the Interior Point Method (IPM) from O(n ) some of the higher-level issues that we are working on: to O(np/m), and improves computation time to O(np2/m). For instance, a task taking 7-days to run on one single ma- Coping with failure. Bigtable software runs on lots chine takes PVSM to complete in two hours on 200 machines. • of machines: enough to almost guarantee that we will

36 SIGMOD Record, March 2008 (Vol. 37, No. 1) eventually run into buggy hardware (faulty memory is involve large parallel scans. Although Google’s infrastruc- one of the more problematic issues). We are investigat- ture supports these scans relatively well, there are instances ing better ways to deal with such problems, without where it is desirable to work with a sample of the data in a hurting performance dramatically. Bigtable. This section discusses the challenges and oppor- tunities to build such a sampling feature. Sharing machines across users. Although Bigtable’s • A row in a Bigtable is keyed by a unique string called a interface supports multiple users, the implementation rowname and each row has its data spread across a number did not until recently do a good job of providing suf- of column families. A column family may comprise a vari- ficient isolation between them. We are still improving able number of actual columns. Since Bigtables are sparse Bigtable’s ability to do resource management and iso- structures, a row may or may not exist for a given query, lation. depending on which columns that query requested. Data Cross-data-center replication. Bigtable currently is maintained in lexicographical order but different columns • allows clients can set up lazy replication between their may or may not be stored apart. Because of such semantics tables (which can be in different data centers). This and storing scheme, skipping N rows is not feasible without replication system guarantees eventual consistency, which actually reading them. Even finding the count of rows in a suffices for many of our clients. However, some clients Bigtable at any point in time can be done only probabilis- (in particular, those that are building user-facing prod- tically. On the bright side, since Bigtable does not provide ucts) need stronger guarantees on the consistency and a relational query engine, we do not need to consider what availability of their data. We are building in support are suitable sampling methods for various relational oper- for these stronger consistency guarantees, both on top ators (like joins) or take into account how sampling errors of Bigtable and inside Bigtable. compound with increasing levels of query composition. Attaching computation to data. To support long- Uniform Random Sampling. Our sampling scheme ex- • running computations that need to access data in Bigtable, tracts and presents a sample of a Bigtable’s rows as if it we have been adding APIs that allow clients to run were a Bigtable itself, which we call a Minitable. The ratio- code on the same machines as their data. Although the nale here is that code written to run against a Bigtable can Map-Reduce framework [3] does provide some support run unchanged against a sample thereof. for running computation near data, it does not provide Our sampling is based on a hash scheme. We pick a conve- any strong guarantees. nient hash function that maps the rowname space into a very large keyspace (e.g., a ax + b mod p function, where p is as More expressive queries. Bigtable does not sup- 128 • large as 2 ). The rows falling into the first fp keys where f port SQL; it currently supports the use of a Google- is the relative sample size (it is a fraction), would belong in designed language called Sawzall [9] to describe server- the sample. Formally, we pick a hash function h : R > 0..p side filtering of data. For various reasons, this support and if h(r) [0, fp 1], then add r to the sample. I−t is easy is awkward to use, and requires a fair amount of work to see that∈the exp−ected size of the sample is f 100% of to describe simple filters. We are in the process of the Bigtable rowcount independent of the rowcoun∗t, and the implementing a small language that will support the probability that a particular row r is in the sample is f, as most common kinds of filters that our clients need. desired. This hash-based sampling method supports main- Direct support for indexing. Many of our clients tenance of the sample with each Bigtable mutation (insert, • want to store indexed data in Bigtable. Currently, update, or deletion). they have to manage the indices themselves. We are Only the system may forward relevant mutations from in the process of building support for indices directly the Bigtable to the Minitable. Otherwise, the latter would into Bigtable. behave as just any other Bigtable: it could be backed up and even be replicated. We are currently deploying Minita- Most of these features are being added directly to Bigtable, bles in the repository of documents that the crawling system but some features are being built as client layers on top of generates. Several Minitables, each with a different sample Bigtable. The Megastore project, for example, is building factor, allow that system to compute aggregates much faster more general support for transactions, consistent replica- and moer often. tion, and DAO. Although Bigtable is not a database, most of the features that we are adding are very familiar to the Biased Sampling. Uniform random sampling is quite use- database community. That fact is unsurprising, given the ful but some scenarios require biased sampling methods. We usefulness of these features. What will be interesting is what are currently working on one such extension that we call the design and implementation of Bigtable is in 1-2 years, Mask Sampling. In this scheme, the decision to select a row and what it tells us about building high-performance, widely to the sample is still based on its rowname but now a user distributed data-storage systems. may specify a mask m over it. The mask, which can be a regular expression that matches portions of a rowname, is 6. MINITABLES: SAMPLING BIGTABLE used to group rows together. Two rows belong to a same group if their masks result in the same string. Mask sam- Alberto Lerner and S. Muthukrishnan pling guarantees that if a group is selected to the sample, As described above, Bigtable is a high-performance, dis- that group will be adequately represented there, regardless tributed, row-storage system that is highly scalable, but it is of that group’s relative size. not meant to provide relational query processing or sophisti- A typical application would be over a Bigtable that stores cated indexing. Therefore, some accesses to a Bigtable may web pages keyed by their URL’s. If one used uniform ran-

SIGMOD Record, March 2008 (Vol. 37, No. 1) 37 dom sampling over it, one may lose information about do- [8] Madhavan J., Cohen S., Dong X., Halevy A., Jeffery mains with relatively few pages. With mask sampling, one S., Ko D., and Yu C. Web-Scale Data Integration: can define how to extract a domain from a URL and deter- You can only afford to Pay as You Go. Proceedings of mine that each domain has the same probability to appear CIDR, pp. 342-350, 2007. in the sample. [9] Pike, R., Dorward, S., Griesemer, R., and Quinlan, S. Specifically, our procedure should return a possibly non- Interpreting the data: Parallel analysis with Sawzall. uniform sample of k.m , that is, rowname k projected only Scientific Programming Journal 13, 4 (2005), 227–298. $ % on the mask. There are at least two details that make the [10] http://liaba.tianya.cn. problem interesting. (a) The set of distinct k.m ’s may be $ % [11] Google Open Social. large and need to be sampled. Using our previous example, http://code.google.com/apis/opensocial. there may be simply too many domains to fit in the desired [12] http://otvety.google.ru/otvety/. sample size. (b) Even though the rownames are unique, the http://wenda.tianya.cn. set of k.m ’s is often not: for each k.m value, we have a set of ro$ ws fro% m the Bigtable and we n$eed %to determine what to retain in the Minitable. Again, using the example URL table, we may need to sample within a chosen domain. Let us consider the set S(n) of rows that have k.m = n. Then, ideally, we would like keep all rows r from$ S(n%) if S(n) is small, to sub-sample with moderate probability if |S(n)| is larger and more aggressively when S(n) is huge. | | The hash-sampling procedure gen| eraliz| es to the biased case as well. We have h1 : k.m [0 . . . p] and retain those that hash into the first f$ frac%tio→n of the range, as be- fore. Then, within each k.m = n that is retained by h1, $ % we apply h2 (dependent on n), this time on the entire row- name as opposed to just the mask. Here, we assume that we have a side table T : k.m = n gn, which is often programmed by an offline$logic% or is ea→sy to maintain in a lazy manner in practice. (It can be indirectly obtained if a uniform Minitable is present.) We call this table gtable because it contains a row for each groupby specified by the mask. In practice, this sampling scheme may give us a biased Minitable from the URL Bigtable with a representative sam- ple of domains. Each of them would carry enough rows to allow for the computation of approximate aggregations, for instance, even if the domains chosen had a large variance in term of number of rows in the base Bigtable.

7. REFERENCES [1] E. Y. Chang, K.Zhu et al., Parallelizing Support Vector Machines on Distributed Computers. Proceedings of NIPS 2007. downloadable open source at http://code.google.com/p/psvm/. [2] Chang, F., et al. Bigtable: A Distributed Storage System for Structured Data. In Proc. of the 7th OSDI (Dec. 2006), pp. 205–218. [3] Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proc. of the 6th OSDI (Dec. 2004), pp. 137–150. [4] Dong X. and Halevy A. Indexing Dataspaces. Proceedings of the International Conference on Management of Data (SIGMOD), pp. 43-54, 2007. [5] Dong X., Halevy A., and Yu C. Data Integration with Uncertainty. International Conference on Very Large Databases (VLDB), pp. 687-698, 2007. [6] Franklin M., Halevy A., and Maier D. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4): 27-33, 2005. [7] Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proc. of the 19th ACM SOSP (Dec. 2003), pp. 29–43.

38 SIGMOD Record, March 2008 (Vol. 37, No. 1)