Web Data Extraction, Applications and Techniques: a Survey
Total Page:16
File Type:pdf, Size:1020Kb
Web Data Extraction, Applications and Techniques: A Survey Emilio Ferraraa,∗, Pasquale De Meob, Giacomo Fiumarac, Robert Baumgartnerd aCenter for Complex Networks and Systems Research, Indiana University, Bloomington, IN 47408, USA bUniv. of Messina, Dept. of Ancient and Modern Civilization, Polo Annunziata, I-98166 Messina, Italy cUniv. of Messina, Dept. of Mathematics and Informatics, viale F. Stagno D'Alcontres 31, I-98166 Messina, Italy dLixto Software GmbH, Austria Abstract Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains. Keywords: Web Information Extraction, Web Data Mining, Business Intelligence, Knowledge Engineering, Knowledge-based Systems, Information Retrieval ∗Corresponding author Email addresses: [email protected] (Emilio Ferrara), [email protected] (Pasquale De Meo), [email protected] (Giacomo Fiumara), [email protected] (Robert Baumgartner) Preprint submitted to Knowledge-based systems June 5, 2014 Contents 1 Introduction 3 1.1 Challenges of Web Data Extraction techniques . .3 1.2 Related work . .4 1.3 Our contribution . .4 1.4 Organization of the survey . .5 2 Techniques 5 2.1 Tree-based techniques . .6 2.1.1 Addressing elements in the document tree: XPath . .6 2.1.2 Tree edit distance matching algorithms . .7 2.2 Web wrappers . .9 2.2.1 Wrapper generation and execution . 10 2.2.2 The problem of wrapper maintenance . 14 2.3 Hybrid systems: learning-based wrapper generation . 16 3 Web Data Extraction Systems 17 3.1 The main phases associated with a Web Data Extraction System . 17 3.2 Layer cake comparisons . 19 4 Applications 21 4.1 Enterprise Applications . 23 4.1.1 Context-aware advertising . 23 4.1.2 Customer care . 24 4.1.3 Database building . 24 4.1.4 Software Engineering . 24 4.1.5 Business Intelligence and Competitive Intelligence . 24 4.1.6 Web process integration and channel management . 25 4.1.7 Functional Web application testing . 25 4.1.8 Comparison shopping . 26 4.1.9 Mashup scenarios . 26 4.1.10 Opinion mining . 26 4.1.11 Citation databases . 26 4.1.12 Web accessibility . 27 4.1.13 Main content extraction . 27 4.1.14 Web (experience) archiving . 27 4.1.15 Summary . 28 4.2 Social Web Applications . 28 4.2.1 Extracting data from a single Online Social Web platform . 30 4.2.2 Extracting data from multiple Online Social Web platforms . 32 4.3 Opportunities for cross-fertilization . 33 5 Conclusions 35 2 1. Introduction Web Data Extraction systems are a broad class of software applications targeting at extracting information from Web sources [79, 11]. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it: for instance, if the source is an HTML Web page, the extracted information could consist of elements in the page as well as the full-text of the page itself. Eventually, extracted data might be post-processed, converted in the most convenient structured format and stored for further usage [131, 63]. Web Data Extraction systems find extensive use in a wide range of applications including the analysis of text- based documents available to a company (like e-mails, support forums, technical and legal documentation, and so on), Business and Competitive Intelligence [9], crawling of Social Web platforms [17, 52], Bio- Informatics [99] and so on. The importance of Web Data Extraction systems depends on the fact that a large (and steadily growing) amount of information is continuously produced, shared and consumed online: Web Data Extraction systems allow to efficiently collect this information with limited human effort. The availability and analysis of collected data is an indefeasible requirement to understand complex social, scientific and economic phenomena which generate the information itself. For example, collecting digital traces produced by users of Social Web platforms like Facebook, YouTube or Flickr is the key step to understand, model and predict human behavior [68, 94, 3]. In the commercial field, the Web provides a wealth of public domain information. A company can probe the Web to acquire and analyze information about the activity of its competitors. This process is known as Competitive Intelligence [22, 125] and it is crucial to quickly identify the opportunities provided by the market, to anticipate the decisions of the competitors as well as to learn from their faults and successes. 1.1. Challenges of Web Data Extraction techniques The design and implementation of Web Data Extraction systems has been discussed from different perspec- tives and it leverages on scientific methods coming from various disciplines including Machine Learning, Logic and Natural Language Processing. In the design of a Web Data Extraction system, many factors must be taken into account; some of them are independent of the specific application domain in which we plan to perform Web Data Extraction. Other factors, instead, heavily depend on the particular features of the application domain: as a consequence, some technological solutions which appear to be effective in some application contexts are not suitable in others. In its most general formulation, the problem of extracting data from the Web is hard because it is constrained by several requirements. The key challenges we can encounter in the design of a Web Data Extraction system can be summarized as follows: • Web Data Extraction techniques often require the help of human experts. A first challenge consists of providing a high degree of automation by reducing human efforts as much as possible. Human feedback, however, may play an important role in raising the level of accuracy achieved by a Web Data Extraction system. A related challenge is, therefore, to identify a reasonable trade-off between the need of building highly automated Web Data Extraction procedures and the requirement of achieving accurate performance. • Web Data Extraction techniques should be able to process large volumes of data in relatively short time. This requirement is particularly stringent in the field of Business and Competitive Intelligence because a company needs to perform timely analysis of market conditions. • Applications in the field of Social Web or, more in general, those dealing with personal data must provide solid privacy guarantees. Therefore, potential (even if unintentional) attempts to violate user privacy should be timely and adequately identified and counteracted. 3 • Approaches relying on Machine Learning often require a significantly large training set of manually labeled Web pages. In general, the task of labeling pages is time-expensive and error-prone and, therefore, in many cases we can not assume the existence of labeled pages. • Oftentimes, a Web Data Extraction tool has to routinely extract data from a Web Data source which can evolve over time. Web sources are continuously evolving and structural changes happen with no forewarning, thus are unpredictable. Eventually, in real-world scenarios it emerges the need of maintaining these systems, that might stop working correctly if lacking of flexibility to detect and face structural modifications of related Web sources. 1.2. Related work The theme of Web Data Extraction is covered by a number of reviews. Laender et al. [79] presented a survey that offers a rigorous taxonomy to classify Web Data Extraction systems. The authors introduced a set of criteria and a qualitative analysis of various Web Data Extraction tools. Kushmerick [77] defined a profile of finite-state approaches to the Web Data Extraction problem. The author analyzed both wrapper induction approaches (i.e., approaches capable of automatically generating wrappers by exploiting suitable examples) and maintenance ones (i.e., methods to update a wrapper each time the structure of the Web source changes). In that paper, Web Data Extraction techniques derived from Natural Language Processing and Hidden Markov Models were also discussed. On the wrapper induction problem, Flesca et al. [45] and Kaiser and Miksch [64] surveyed approaches, techniques and tools. The latter paper, in particular, provided a model describing the architecture of an Information Extraction system. Chang et al. [19] introduced a tri-dimensional categorization of Web Data Extraction systems, based on task difficulties, techniques used and degree of automation. In 2007, Fiumara [44] applied these criteria to classify four state- of-the-art Web Data Extraction systems. A relevant survey on Information Extraction is due to Sarawagi [105] and, in our opinion, anybody who intends to approach this discipline should read it. Recently, some authors focused on unstructured data management systems (UDMSs) [36], i.e., software systems that analyze raw text data, extract from them some structure (e.g. person name and location), integrate the structure (e.g., objects like New York and NYC are merged into a single object) and use the integrated structure to build a database.