Monitoring Multiple Web Pages and Presentation of Xml Pages
Total Page:16
File Type:pdf, Size:1020Kb
WEBVIGIL: MONITORING MULTIPLE WEB PAGES AND PRESENTATION OF XML PAGES Sharavan Chamakura, Alpa Sachde, Sharma Chakravarthy and Akshaya Arora Information Technology Laboratory Computer Science and Engineering Department The University of Texas, Arlington, TX 76019 {chamakura, sachde, sharma, aarora}@cse.uta.edu Abstract Web, and this data is increasing at a rapid pace. This plethora of information often leads to In the case of large-scale distributed situations wherein users looking for specific environments such as the Internet, users are pieces of information are flooded with irrelevant interested in monitoring changes to a particular data. For example, an investment banking web page (XML or HTML). There are many professional who is only interested in news instances in which the change detection is related to his areas of interest, is flooded with required at a finer granularity, such as changes news on many topics when he goes to any to links, images, phrases or keywords in a page. popular news website. In some cases users are WebVigiL is a general-purpose, change concerned in knowing about the updates monitoring, notification and presentation system happening to a particular web page of their for web pages. It handles specification, interest such as the keywords, links or other management, and propagation of customized change types. For example, students might want changes as requested by a user in an efficient to know about the updates that are made to the way. When the user specifies an URL, it may course websites regarding any new projects or contain frames instead of a single content page; assignments. There may be some situations this entails additional pages to be fetched from where users are interested in tracking changes the frames for detecting changes. In this paper across multiple web pages. For example, a user we propose different approaches for detecting of a discussion board related to technical topics changes over pages containing frames and might be interested in knowing about posts generalizing it in terms of depth. We also being made regarding a particular set of topics. address the case of monitoring multiple web In this case, the user wants to be notified of the pages for different change types with a single changes related to various topics present on request. As there are no presentation semantics different web pages. In general, the ability to and tags, manual identification of the detected specify changes to arbitrary documents and get changes is difficult on a XML web page. Once notified in different ways will be useful for detected, changes have to be notified to the reducing the inefficient navigation. user in a meaningful manner. In this paper we WebVigiL [1-3] is a change detection and also propose various presentation schemes to notification system developed to provide users display the changes computed on XML pages . with the ability to monitor web pages (XML/ HTML) according to a profile. Users’ place a request by giving a set of specifications like the Keywords page to be monitored, the type of change to be checked for, when to fetch the page and how to Change Detection, Multiple Web Pages, notify the changes. The user request is in the Presentation. form of a sentinel that is used for change detection and presentation. The changes are 1. Introduction detected and stored in the change repository. The stored changes are then notified to the user. The World Wide Web has outdone traditional Figure 1 summarizes the high level architecture media such as television, to become an of WebVigiL [1]. Information from the sentinel is indispensable source of information. There is extracted, validated and stored in a knowledge data related to each and every context on the base and is used by other modules in the system. The presentation module provides the additional attributes such as insert = delete to various presentation schemes for detected the original XML document. changes to present them in a well-organized and readable format. The fetch module is responsible for fetching the pages in an efficient/optimal manner for detecting changes. The version control module stores different versions of a page until it is no longer needed for detecting changes and for notification of changes. The change detection module interacts with various modules for detecting changes in an efficient manner and placing the results of changes in the knowledge for notification by the presentation/notification module. Many a times, users are interested in monitoring multiple URLs or even entire web sites. Users may place a sentinel on a page without realizing that it is a frame and not an ordinary page. In this case, we want the WebVigiL system to transparently process the Figure 1: System Architecture page with frames taking the user intent into account. Support for monitoring multiple pages Diff and Merge Tool [6] provided by IBM in general allows us to treat frames as a special compares two XML files based on node case of multiple pages. We use the notion of identification. It represents the differences depth to specify how many levels one needs to between the base and the modified XML files consider from the root. using a tree display in the left-hand, Merged This paper extends upon our previous work View pane with symbols and colours to highlight in customized content monitoring of web pages. the differences using the XPath syntax. The contributions of this paper are: a) the DOMMITT [7] is a UNIX diff utility tool improved Change Detection Graph (CDG) which that enables the users to view differences supports the monitoring of multiple URLs for between the DOM representations of two XML either similar or different change types. As a documents. The Diff algorithms on these DOM subset, this includes the monitoring of web [8] representations produces edit scripts, which pages containing frames, b) Presentation are merged into the first document to produce schemes for the semi structured XML pages. an XML document in which the edit operations The rest of the paper is organized as follows. are represented as insert = delete tags for a Section 2 discusses the overview of the related user interactive display of differences. work. Section 3 and 4 will discuss, respectively, AIDE presents a set of tools (collectively a mechanism to monitor multiple URLs, which called AT&T Internet Difference Engine) [9] that includes the class of web pages with frames and detect when pages have been modified and the schemes for presentation of detected present modifications to user through marked up changes for XML webpage. Section 5 provides HTML. AIDE uses HTML diff to graphically the conclusion and future work. present differences using heuristics to determine additions and deletions between versions of a 2. Related Work page. AIDE uses the three most popular schemes the dual frame scheme, merged view and only change scheme. All of the techniques Most of the research has been done in change mentioned above produce edit scripts to apply detection systems for web pages. In WebVigiL on the documents to generate the latest version the change detection is based on an expressive with changes. The requirement of the WebVigiL user specification. was to display the customized changes Delta XML [4] developed by Mosnell computed by the detection algorithms based on provides a plug-in solution for detecting and user input. The changes were at a finer displaying changes to contents between two granularity than a page. Hence the above versions of an XML document. They represent techniques could not be used directly for changes in a merged delta file [5] by adding resolving the issues for presentation of the detection graph constructed for sentinels S1, S2 detected changes in a legible format. and S3 are shown is Figure 2. The Current WebCQ [10] system supports change detection over various primitive S AND S3 change types such as links, images, keywords, 1 lists, and tables but there is no provision to give Sand composite change requests. Also, the current S2 Sand WebCQ system does not support the monitoring of multiple web pages with a single request. group 2 group group 1 Users cannot combine multiple URLs with different change types. It provides the interface Links to display the web page specified by the user, if Images the web page contains frames user have the flexibility to select the particular frame for the change detection, if the frame is not selected Figure 2: Change Detection Graph then the change will only be detected on the main webpage. The base page can be assumed The different types of nodes in the CDG are: to be present at level zero; all the pages pointed 1. URL node: A URL node is a leaf node that by the base page at level one and so on. If denotes the page being monitored. there are links on the frames WebCQ cannot 2. Change type node: All level 1 nodes are monitor changes on the web page said to be change type (links, keywords, corresponding to that links means monitoring the images, phrases) nodes. These nodes web pages in depth is not possible but in perform the change detection over the WebVigiL the user only specifies the URL and respective types. the changes are detected on all the frames if 3. Composite node: A Composite node exists of the webpage specified by the user. represents a combination of change types. Unlike WebCQ, WebVigiL is capable of All nodes above level 1 belong to this type. detecting changes to one level in depth of a web The URL nodes are notified regularly after page. fetching the required pages with the specified periodicity. To reduce redundant fetching, a page is fetched only if it’s last modified time 3.