Harnessing the Deep Web: Present and Future Jayant Madhavan Loredana Afanasiev Lyublena Antova Alon Halevy Google Inc. Universiteit van Amsterdam Cornell University Google Inc.
[email protected] [email protected] [email protected] [email protected] 1. INTRODUCTION pre-compute queries to forms and inserts the resulting pages The Deep Web refers to content hidden behind HTML into a web-search index. These pages are then treated like forms. In order to get to such content, a user has to perform any other page in the index and appear in answers to web- a form submission with valid input values. The name Deep search queries. We have pursued both approaches in our Web arises from the fact that such content was thought to work. In Section 3 we explain our experience with both, be beyond the reach of search engines. The Deep Web is and where each approach provides value. also believed to be the biggest source of structured data on We argue that the value of the virtual integration ap- the Web and hence accessing its contents has been a long proach is in constructing vertical search engines in specific standing challenge in the data management community [1, domains. It is especially useful when it is not enough to 8, 9, 13, 14, 18, 19]. just focus on retrieving data from the underlying sources, Over the past few years, we have built a system that ex- but when users expect a deeper experience with the con- posed content from the Deep Web to web-search users of tent (e.g., making purchases) after they found what they Google.com.