Logic-Based Web Information Extraction∗
Total Page:16
File Type:pdf, Size:1020Kb
Logic-based Web Information Extraction∗ Georg Gottlob and Christoph Koch Database and Artificial Intelligence Group, Technische Universit¨at Wien, A-1040 Vienna, Austria. fgottlob, [email protected] 1 Introduction These are just two reasons for which wrapping tools need to assist humans to render the creation of wrap- The Web wrapping problem, i.e., the problem of ex- pers a more manageable task. Two ways to approach tracting structured information from HTML docu- this requirement have been proposed: the use of ma- ments, is one of great practical importance. The often- chine learning techniques to create wrappers automat- observed information overload that users of the Web ically from annotated examples (e.g. [23, 30]) and the experience witnesses the lack of intelligent and encom- visual specification of wrappers. The first approach passing Web services that provide high-quality col- currently suffers from the need to provide machine lected and value-added information. The Web wrap- learning algorithms with too many example instances ping problem has been addressed by a significant { which have to be wrapped manually { and from neg- amount of research work. Previous work can be clas- ative theoretical results that put a bound on the ex- sified into two categories, depending on whether the pressive power of learnable wrappers.1 HTML input is regarded as a sequential character By the second candidate2 for a substantial produc- string (e.g., [34, 27, 24, 30, 23]) or a pre-parsed docu- tivity leap, the visual specification of wrappers, we ment tree (for instance, [35, 25, 22, 29, 3, 2, 26]). The ideally mean the process of interactively defining a latter category of work thus assumes that systems may wrapper from one (or few) example document(s) using make use of an existing HTML parser as a front end. mainly \mouse clicks", supported by a strong and intu- Robust wrappers are easier to program using a itive design metaphor. During this visual process, the wrapper programming language that models docu- wrapper program should be automatically generated ments as pre-parsed document trees rather than as text and should not actually require the human designer strings. Writing a fully standards-compliant HTML to know the wrapper programming language. Visual parser is a substantial task, which should not have to wrapping is now a reality supported by several imple- be redone from scratch for each wrapper being created. mented systems (cf. XWrap [25], W4F [35], and Lixto The use of an existing parser allows the wrapper im- [2]), however with varying thoroughness. plementor to focus on the essentials of each wrapping One may thus want to look for a wrapping language task and to work on a higher, more user-friendly level. over HTML document trees that Simplifications such as this one are important, since Web service designers often face the task of wrapping (i) has a solid theoretical foundation, a large number of Web sites. In order to provide a (ii) provides a good trade-off between complexity and useful Web service, the information from a significant the number of practical wrappers that can be ex- number of source sites relevant to the domain of the pressed, service has to be integrated and made accessible in a (iii) is easy to use as a wrapper programming lan- uniform manner. Otherwise, a Web service may fail to guage, and attract the acceptance of the users it is intended for. (iv) is suitable for being incorporated into visual tools, Moreover, Web page layouts may be subject to fre- since ideally all constructs of a wrapping lan- quent change. This is often intentional { to discourage guage can be realized through corresponding vi- screen-scraping wrapper access and to force humans to sual primitives. personally visit the sites. In such a case wrapper pro- 1For example, it is known that even regular string languages grams may need to be adapted. cannot be learned from positive examples only [12]. 2A promising direction of future work on Web wrapping may ∗ Database Principles Column. Column editor: Leonid be to combine visual specification with machine learning tech- Libkin, Department of Computer Science, University of niques, in order to make the visual specification process more Toronto, Toronto, Ontario M5S 3H5, Canada. E-mail: \intelligent" and at the same time guide the learning process to [email protected]. reduce the number of examples needed to learn a wrapper. Clearly, languages which do not have the right ex- pressive power and computational properties cannot be considered satisfactory, even if wrappers are easy P P to define. A few words on the \right expressiveness" Q Q Q Q of a wrapper programming language are in order here. Throughout the literature, the scope of wrapping is a conceptually limited one (see e.g. [11, 20]). Informa- R R R R R R tion systems architectures that employ wrapping usu- ally consist of at least two layers, a lower one that is re- (a) (b) stricted to extracting relevant data from data sources and making them available in a coherent representa- Figure 1: Tree annotated with predicates P , Q, and tion using the data model supported by the higher R defined by information extraction functions (a), and layer, and a higher layer in which data transformation wrapping result (b). (Original node labels of the input and integration tasks are performed which are neces- tree are not shown.) sary to fuse syntactically coherent data from distinct node-selecting queries on trees (see e.g. [33, 32, 16, 21]; sources in a semantically coherent manner. With the [36] acknowledges the role of MSO but argues for even term wrapping we refer to the lower, syntactic integra- stronger languages). In our experience, when con- tion layer. The higher, semantic integration layer will sidering a wrapping system that lacks this expressive not be further considered in this article. Therefore, power, it is usually quite easy to find real-life wrapping a wrapper is assumed to extract relevant data from a problems that cannot be handled (see also the related possibly poorly structured source and to put it into the discussion on MSO expressiveness and node-selecting desired representation formalism by applying a num- queries in [21]). ber of transformational changes close to the minimum In this article, we discuss monadic datalog over possible. A wrapping language that permits arbitrary trees, a simple form of the logic-based language data- data transformations may be considered overkill. log, as a wrapper programming language. As argued In this article, we provide a survey of the (logic- for in detail in [16], monadic datalog is the first and based) approch to visual wrapping that has been in- currently only language which satisfies desiderata (i) troduced in [2, 3, 16] and has been implemented in a to (iv) raised above. commercial software product, the Lixto Visual Wrap- The article is structured as follows. We start with per [26]. The core notion that this wrapping approach preliminaries regarding trees and MSO in Section 2 is based on is that of an information extraction func- and introduce monadic datalog in Section 3. In Sec- tion, which takes a labeled unranked tree (representing tion 4, we discuss the complexity of monadic datalog a Web document) and returns a subset of its nodes. A over trees and its expressive power. Interestingly, mo- wrapper is a program which implements one or several nadic datalog is equivalent to MSO in its ability to such functions, and thereby assigns unary predicates to express unary queries on trees. Monadic datalog can document tree nodes. Based on these predicate assign- be evaluated in linear time both in the size of the data ments and the structure of the input document viewed and the query, given that tree structures are appropri- as a tree, a new tree can be computed as the result of ately represented. We also define a simple normal form the information extraction process in a natural way, for monadic datalog over trees, TMNF, to which any along the lines of the input tree but using the new la- monadic datalog program over trees can be mapped in bels and omitting nodes that have not been relabeled. linear time. In Section 5, we discuss monadic datalog (See Figure 1 for an example of such a transforma- as a wrapper programming language. Finally, in Sec- tion.) That way, we can take a tree, re-label its nodes, tion 6, we look at the specification of wrappers based and declare some of them as irrelevant, but we cannot on monadic datalog without requiring the wrapper im- significantly transform its original structure. This co- plementor to deal with or know datalog, by an entirely incides with the intuition that a wrapper may change visual process that works on example Web documents. the presentation of relevant information, its packaging We conclude with Section 7. or data model (which does not apply in the case of Web wrapping), but does not handle substantial data 2 Tree Structures transformation tasks. We believe that this captures the essence of wrapping. Trees are defined in the normal way and have at least We assume unary queries in monadic second-order one node. We assume that the children of each node are in some fixed order. Each node has a label taken logic (MSO) over trees as our expressiveness yardstick 3 for information extraction functions. MSO over trees from a finite nonempty set of symbols Σ, the alpha- is well-understood theory-wise [38, 7, 5, 8] (see also bet. We consider only unranked finite trees, which cor- [39]) and quite expressive. In fact, it is considered by 3The finite alphabet choice is discussed in more detail below, many as the language of choice for defining expressive in Remark 2.1.