ENHANCEMENT BY EXTRACTING HIDDEN AJAX CONTENT IN WEB APPLICATIONS by PAUL SUGANTHAN G C 20084053 MUTHUKUMAR V 20084041 NANDHAKUMAR B 20084043

A project report submitted to the FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING

in partial fulfillment of the requirements for the award of the degree of BACHELOR OF ENGINEERING in COMPUTER SCIENCE AND ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING ANNA UNIVERSITY CHENNAI CHENNAI - 600025 MAY 2012 CERTIFICATE

Certified that this project report titled “ SEARCH ENGINE ENHANCEMENT BY EXTRACTING HIDDEN AJAX CONTENT IN WEB APPLICATIONS” is the bonafide work of PAUL SUGANTHAN G C (20084053), MUTHUKUMAR V (20084041), NANDHAKUMAR B (20084043) who carried out the project work under my supervision, for the fulfillment of the requirements for the award of the degree of Bachelor of Engineering in Computer Science and Engineering. Certified further that to the best of my knowledge, the work reported herein does not form part of any other thesis or dissertation on the basis of which a degree or an award was conferred on an earlier occasion on these are any other candidates.

Place: Chennai Dr. V Vetriselvi Date: Project Guide, Designation, Department of Computer Science and Engineering, Anna University Chennai, Chennai - 600025

COUNTERSIGNED

Head of the Department, Department of Computer Science and Engineering, Anna University Chennai, Chennai – 600025 ACKNOWLEDGEMENTS

We express our deep gratitude to our guide, Dr. V VETRISELVI for guiding us through every phase of the project. We appreciate her thoroughness, tolerance and ability to share her knowledge with us. We thank her for being easily approachable and quite thoughtful. Apart from adding her own input, she has encouraged us to think on our own and give form to our thoughts. We owe her for harnessing our potential and bringing out the best in us. Without her immense support through every step of the way, we could never have it to this extent.

We are extremely grateful to Dr. K.S. EASWARAKUMAR, Head of the Department of Computer Science and Engineering, Anna University, Chennai 600025, for extending the facilities of the Department towards our project and for his unstinting support.

We express our thanks to the panel of reviewers Dr. ARUL SIROMONEY, Dr. A.P. SHANTHI and Dr. MADHAN KARKY (list of panel members) for their valuable suggestions and critical reviews throughout the course of our project.

We thank our parents, family, and friends for bearing with us throughout the course of our project and for the opportunity they provided us in undergoing this course in such a prestigious institution.

Paul Suganthan G C Muthukumar V Nandhakumar B ABSTRACT

Current search engines such as and Yahoo! are prevalent for searching the Web. Search on dynamic client-side Web pages is, however, either inexistent or far from perfect, and not addressed by existing work, for example on . This is a real impediment since AJAX and Rich Internet Applications are already very common in the Web. AJAX applications are composed of states which can be seen by the user, but not by the search engine, and changed by the user using client-side events. Current search engines either ignore AJAX applications or produce false negatives. The reason is that crawling client-side code is a difficult problem that cannot be solved naively by invoking user events.

The project is aimed to propose a solution for crawling and extracting the hidden ajax content. Thus enabling the search engines to enhance its search result quality by indexing dynamic ajax content. Though AJAX can be crawled by testing manually in browser by invoking client side events, enhancing the search engine to crawl AJAX content automatically similar to traditional web applications hasn’t been achieved.

The project describes the design and implementation of an AJAX Crawler. Then enabling search engine to index the crawled states of an AJAX page. The performance of AJAX Crawler is evaluated and compared with traditional crawler. The possible issues regarding crawling AJAX content and future optimizations are also analysed. திட்டப்பணிச் சுருக்கம்

தற்ேபாது உள்ள ேதடு ெபாறிகள் அைனத்தும், இைணயதளத்தில் உள்ள அடிக்கடி மாறுகின்ற உைரைய ெகாண்டுள்ள வைலப்பக்கங்கைள ேதடுவதில்ைல. இதனால் இைணயதளத்தில் உள்ள பல உைரகள் மக்களுக்கு ெதrயாமல் ேபாகிறது. இத்திட்டத்தின் ேநாக்கம் மைறந்துள்ள பல உைரகைள ேதடு ெபாறிகளுக்கு ெதrய ெசய்வது. பிரதான கூகுள், யாஹூ ேபான்ற ேதடு ெபாறிகள் கூட பல உைரகைள கண்டுெகாள்ளாமல் இருக்கிறது. எனேவ இத்திட்டம் மூலம் இைணயதளத்தில் உள்ள மைறந்துள்ள பல உைரகள் ேதடு ெபாறிகளால் கண்டுபிடிக்கப்படும். ஆகேவ இைணயதளத்தில் உள்ள மைறந்துள்ள உைரகளின் எண்ணிக்ைக குைறயும். இத்திட்டத்தின் மூலம் ேதடு ெபாறிகளின் திறைம அதிகrக்கப்படும். Contents

CERTIFICATEi

ACKNOWLEDGEMENTS ii

ABSTRACT(ENGLISH) iii

ABSTRACT(TAMIL) iv

LIST OF FIGURES viii

LIST OF TABLES ix

LIST OF ABBREVIATIONSx

1 INTRODUCTION1 1.1 AJAX...... 1 1.2 Crawler...... 2 1.3 Problem Definition...... 3 1.4 Scope of the Project...... 4 1.5 Organisation of this Report...... 4

2 RELATED WORK5 2.1 Crawling AJAX...... 5 2.2 Finite State Machine...... 8 2.3 Google’s AJAX Crawling Scheme...... 8

3 REQUIREMENTS ANALYSIS 11 3.1 Functional Requirements...... 11 3.2 Non-Functional Requirements...... 12 3.2.1 ...... 12 3.2.2 Hardware Considerations...... 12 3.2.3 Performance Characteristics...... 12 3.2.4 Security Issues...... 13

v 3.2.5 Safety Issues...... 13 3.3 Constraints...... 13 3.4 Assumptions...... 14

4 SYSTEM DESIGN 15 4.1 System Architecture...... 15 4.1.1 Architecture Diagram...... 15 4.2 Module Descriptions...... 17 4.2.1 Identification of Clickables...... 17 4.2.2 Event Invocation...... 19 4.2.3 State Machine representation of AJAX website...... 19 4.2.3.1 Visualizing the State Machine...... 21 4.2.4 Indexing...... 22 4.2.5 Searching...... 22 4.2.6 Reconstruction of state...... 22 4.3 User Interface Design...... 22 4.4 UseCase Model...... 23 4.4.1 UseCase Diagram...... 23 4.5 System Sequence Diagram...... 24 4.5.1 Event Invocation...... 24 4.5.2 Searching...... 24 4.6 Data Flow Model...... 25 4.6.1 Data Flow Diagram...... 25

5 SYSTEM DEVELOPMENT 28 5.1 Implementation...... 28 5.1.1 Tools Used...... 28 5.1.2 Implementation Description...... 28 5.1.2.1 Ajax Crawling Algorithm...... 29 5.1.2.2 State Machine...... 32 5.1.2.3 Indexing...... 34 5.1.2.4 Searching...... 35 5.1.2.5 Reconstruction of a particular state after crawling 36

6 RESULTS AND DISCUSSION 37 6.1 Results...... 37 6.2 Performance Evaluation...... 39

vi 6.2.1 Crawling Time...... 39 6.2.1.1 Number of States Vs Crawling Time...... 40 6.2.2 Clickable Selection Policy...... 41 6.2.2.1 Number of AJAX Requests Vs Probable Clickables...... 42 6.2.2.2 Probable Clickables Vs Detected Clickables.. 43 6.2.3 Clickable Selection Ratio Vs Crawling Time...... 44 6.3 Search Result Quality...... 45 6.4 Observations...... 49

7 CONCLUSIONS 50 7.1 Contributions...... 50 7.2 Future Work...... 50

A Snapshots 52 A.1 Search Interface...... 52 A.2 Google Bot and AJAX Crawler...... 54

B DOM 58 B.1 DOM - ...... 58 B.2 DOM Tree Representation...... 58

References 60

vii List of Figures

1.1 Crawler Architecture...... 3 2.1 AJAX Crawling Scheme...... 9 2.2 Control Flow...... 10 4.1 Architecture Diagram...... 16 4.2 Visualizing State Machine...... 21 4.3 UseCase Diagram...... 23 4.4 Sequence Diagram - Event Invocation...... 24 4.5 Sequence Diagram - Searching...... 25 4.6 Level 0 Data Flow Diagram...... 25 4.7 Level 1 Data Flow Diagram...... 26 4.8 Level 1 Data Flow Diagram...... 27 6.1 Number of States Vs Crawling Time(in minutes)...... 40 6.2 Number of AJAX Requests Vs Probable Clickables...... 42 6.3 Probable Clickables Vs Detected Clickables...... 43 6.4 Clickable Selection Ratio Vs Crawling time per state(in minutes). 44 A.1 Interface I...... 52 A.2 Interface II...... 53 A.3 Fetched By Google Bot...... 54 A.4 Fetched By Google Bot...... 55 A.5 Fetched By AJAX Crawler...... 56 A.6 Fetched By AJAX Crawler...... 57 B.1 DOM Tree...... 59

viii List of Tables

5.1 Tools Used...... 28 6.1 Test Cases...... 37 6.2 Experimental Results...... 38 6.3 Crawling Time...... 39 6.4 Clickable Selection Policy...... 41

ix LIST OF ABBREVIATIONS

Acronym What (it) Stands For AJAX Asynchronous Javascript And Xml CSS Cascading Style Sheet DOM Document Object Model HTML Hyper Text Markup Language JS Java Script JUNG Java Universal Network Graph Framework URL Uniform Resource Location XML Extensible Markup Language

x CHAPTER 1

INTRODUCTION

Web applications are becoming more and more the replacement of desktop applications. In this chapter we introduce the techniques that support this change, and we give an outline of this thesis. The first section presents AJAX, the major new technique and architectural change for web applications over the past years. Section 1.2 discusses and explains the operation of a crawler. Section 1.3 presents the research problems of this thesis. Section 1.4 presents the scope of this thesis. Section 1.5 dicusses the organisation of this thesis.

1.1 AJAX

AJAX is an acronym for Asynchronous JavaScript and XML. AJAX is a technique whereby a website can update part of a page without refreshing the whole content. This saves bandwidth and provides for a more interactive user experience. In other words, changes that a user makes appear quicker on the screen, and the website seems to respond much faster. The improved action increases the interactivity of websites and makes the user experience much more enjoyable. It should be noted that AJAX is not a technology in its own right, rather, it is a technique that utilizes other technologies. AJAX is considered one of the core techniques behind Web 2.0 applications.

AJAX is a clever combination of using the client-side JavaScript engine [11] to 1 2 update small parts of the Document Object Model (DOM) with information retrieved by asynchronous server communication. By using AJAX technology developers can create applications in which the page does not have to be re-rendered again every time an interaction has taken place; only small sub-sets of the page need to get updated.

A common problem with AJAX applications is the disability of the web browser’s Back button. In a normal non-AJAX application, every webpage has a unique URL. Thus, a user can hit the Back button to take him back to the previous URL, which would be the state that the browser was in before the user’s last action. This can be seen as a sort of Undo operation. However, with AJAX the URL of the webpage does not change every time the state of the changes. Therefore a press of the back button will bring the user to a state much further back than he might have intended. Also, page bookmarking is dependant upon the URL of the page in question. Therefore, pages created by AJAX will not be bookmarkable.

1.2 Crawler

A [7] is a computer program that browses the in a methodical, automated manner or in an orderly fashion. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Figure 1.1 depicts high-level architecture of a standard Web Crawler. 3

FIGURE 1.1: Crawler Architecture 1.3 Problem Definition

With the advent of Web 2.0 , AJAX is being used widely to enhance interactivity and user experience. Also standalone AJAX applications are also being developed. For eg , and Yahoo! Mail are classic examples of AJAX applications. Current crawlers ignore AJAX content as well as dynamic content added through client side script. Thus most of the dynamic content is still hidden. We have considered two problems in our project.

1. Crawling AJAX Content in websites

2. Making the crawled AJAX Content indexable and searchable 4 1.4 Scope of the Project

The project enables hidden dynamic content to be visible to search engines. Thus the hidden web can be explored to a great extent. The project describes the design and development of an AJAX Crawler and building an AJAX Search Engine to search through the crawled states. Finally we evaluate the performance of the AJAX Crawler. However the scope of the project is limited by the fact that crawling AJAX content is time consuming due to the fact that it requires to execute Javascript unlike traditional crawlers.

1.5 Organisation of this Report

This report is organized as follows.Chapter 2 discusses the related work done in this area. Chapter 3 describes the requirement analysis of the system. Chapter 4 elaborates on the design of the system. Chapter 5 details about the development of the system. Chapter 6 describes the results obtained from our system and also provides an analysis of the results. Finally, Chapter 7 summarizes the work we have completed and presents pointers for future work. CHAPTER 2

RELATED WORK

2.1 Crawling AJAX

Ajax (Asynchronous JavaScript and XML) is one of the most rising and promising techniques in the web application development area of the past few years. Capturing the traditional multi-page application into a single page increases the responsive and interactive experience of a user. Users do not have to click-and-wait any more, and the page does not have to be re-rendered again every time an interaction takes place, i.e., only small sub-sets of the page need to get updated.[14] With these new dynamic applications a new term has emerged,Web 2.0 , which is used to mark the changes in web applications in facilitating communication, information sharing, interoperability, and collaboration. The term Web 2.0 is used to denote Ajax applications but is also, and more commonly, used to denote user-generated content.

The addition of the responsiveness brought by the AJAX technique makes it possible to operate applications on a and inside a browser as if they are desktop applications. Currently the web application market is becoming increasingly dominant and there are operating systems designed around them such as the Chrome OS from Google and the WebOS from Palm. This shows the importance of web applications as a replacement of ordinary applications.

5 6

AJAX is a clever combination of using the client-side JavaScript engine [11] to update small parts of the Document Object Model (DOM) with information retrieved by asynchronous server communication. By using AJAX technology developers can create applications in which the page does not have to be re-rendered again every time an interaction has taken place; only small sub-sets of the page need to get updated. Therefore the users experience a very fast responsive application inside the web-browser. The application is available everywhere the user connects to the internet, and is accessible with every browser. This eliminates the main disadvantages of having to install a full blown desktop application on a computer with a certain amount of computational capacity and the troubles of sharing files with people or other locations. This makes the use of cloud computing interesting. Cloud computing is the term used to describe the trend in the computing world in moving away from desktop applications to on-line services . Although the web applications are not a new phenomena, the use of AJAX techniques are. These new techniques also require a good quality of service, of which testing is an important aspect.

The new AJAX technology does not include the property of having a unique URL representing a unique state in an application . Due to the lack of an external reachable unique state, a state reached by URL, crawlers are not able to access the full content of an AJAX application without the use of a pre-programmed JavaScript engine [7] . This problem of not having a reachable state by URL occurs when crawling and testing an AJAX application. [9]

The first major work in the crawling AJAX was done by Duda [10] , which suggests a way to crawl dynamic comments page in Youtube. They modelled 7

AJAX website as a State Machine. They developed the first AJAX Crawling algorithm which this project uses as a base. They also indexed and made the Youtube dynamic comments page searchable.

To circumvent this problem Mesbah et. al. [15] proposed to crawl AJAX applications by Inferring User Interface State Changes. Their technique focuses around a state machine which stores the actions a user executes on a web-page inside a real browser, starting from the root, the index state, and following traces down to the final state of a certain path. These states are discovered through searching the current DOM-tree for possible elements at which events can be fired. These events include for example the onClick, onMouseOver or onMouseOut, and firing the events on all the possible candidate elements may result in a new states. The result of an event, the DOM-tree, is compared with the DOM-tree from before the execution. If the DOM tree is changed a new state of the application is added to the state machine linked to its predecessor state. The edge between the two states represents the element and event combination that results in the new state originating from the previous state. By storing the combination of an element and an event, the crawler is able to repeat the flow of actions, which result in the given state. By using this information it is possible to bring an AJAX application to a given state, and this makes an AJAX application state aware by adding an external indexing shell. 8 2.2 Finite State Machine

Finite state machines are used to describe the behaviour of a system by recording the transitions from one state to another state. This method is mostly used in verifying software systems or software protocols [5].

The state machine used inside a crawler is not a fully specified state machine, but an incomplete specified state machine. A completely specified state machine is a state machine where every transaction results in a unique new state [16]. When examining an Ajax application it is possible to have multiple transactions resulting in the same state, e.g., two different links can result in the same page. This observation leads to the fact that the state machine used is an incomplete specified state machine. The minimal version of a completely specified state machine can be found in polynomial time [13]. Incomplete specified state machine, are proved to be NP-complete [12] in terms of finding the minimal state machine. This means that there is no algorithm known that minimises an incomplete specified state machine in polynomial time.

2.3 Google’s AJAX Crawling Scheme

Google proposed its own scheme for crawling AJAX [1]. The AJAX websites which conform to this scheme will be crawled by Google Bot. The AJAX crawling scheme proposes to mark the addresses of all the pages that load AJAX content with specific chars. The whole idea behind it is to use special hash fragments (#!) in the URLs of those pages to indicate that they load AJAX 9 content. When Google finds a link that points to an AJAX URL, for example http://example.com/page?query #!state, it automatically interprets it (escapes it) as http://example.com/page?query& escaped fragment =state.

FIGURE 2.1: AJAX Crawling Scheme

The programmer is forced to change his/her Website Architecture in order to handle the above requests. So when Google sends a web request for the escaped URL, the server must be able to return the same HTML code as the one that is presented to the user when the AJAX function is called.

After Google sees the AJAX URL and after interpreting (escaping it), it grabs the content of the page and indexes it. Finally when the indexed page is presented in the Search Results, Google shows the original AJAX URL to the user instead of the escaped one. As a result the programmer should be able to handle users 10 request and present the appropriate content when the page loads.

FIGURE 2.2: Control Flow

The implementation of Google’s AJAX Crawling scheme imposes some constraints on developers. Also a site having less amount of AJAX content cannot be changed to this scheme for the purpose of crawling. Thus the thesis proposes a way to crawl AJAX sites by constructing the state machine of the site, which does not impose any constraints on the developers. We view every site as a AJAX site and start crawling by invoking events.If there is any change in DOM, then we record it in state machine. Thus once the state machine of a URL is generated, we then index the states to enable searching. CHAPTER 3

REQUIREMENTS ANALYSIS

In this chapter, we provide an overview of the requirements and the functionalities of the system.

3.1 Functional Requirements

The project aims to crawl AJAX content in web applications and make it searchable. The abstract modular view of the project is given by the following steps.

1. Identification of Clickables

2. Invocation of events

3. Representing AJAX website as State Machine

4. Indexing the crawled states

5. Searching through the indexed content

6. Reconstruction of a particular state in browser

11 12 3.2 Non-Functional Requirements

3.2.1 User Interface

User Interface is provided for searching. The User Interface is developed using HTML and PHP. The user enters the query in a text box and performs the search. For the browser driven UI, any standard web browser like or is required.

3.2.2 Hardware Considerations

The project is requires a computer with Windows Operating System. The system used in our experiments consists of 320 GB Hard Disk Drive, 2 GB RAM and 1.2 GHz Processor.

3.2.3 Performance Characteristics

As performance forms an important parameter of this project, there are a number of performance consideration factors. The AJAX Crawling is compared with traditional crawling.The factors are :

1. Crawling Time

2. Search Result Quality

3. Clickable Selection Policy 13 3.2.4 Security Issues

As the project is fully software based, there are no security issues concerning this project.

3.2.5 Safety Issues

There are no particular safety issues concerning this project.

3.3 Constraints

• Javascript execution A crawler capable of crawling AJAX requires the capability to execute Javascript.

• Duplicate State Multiple events may lead to same state.Thus we need to eliminate adding duplicate states.

• Infinite State Change If the same events can be invoked indefinitely on the same state, the application model can explode.

• Numerous ways of adding event handlers A Javascript event can be added to an particular HTML element in many ways. Thus all events assigned through various ways should be handled properly. 14 3.4 Assumptions

• No Forms The AJAX Crawler doesn’t handle forms because handling forms is complex. It requires appropriate test data for submitting forms. Also if captcha is present, it is not possible to submit such a form. Also a form may consist of different type of input elements like checkbox, select box, redio button etc. Deep web crawling by handling forms is itself a separate research problem.

• Limiting the number of states The Crawler limits the number of states to prevent state explosion.

• Only Click Event The Crawler invokes only click event on HTML elements during crawling.The elements which can be clicked are termed as clickables.

• Only Text based retrieval The Crawler handles only text based changes. Image based changes like in Google maps are not considered. CHAPTER 4

SYSTEMDESIGN

In this chapter, we describe the deign issues considered in the software development process.

4.1 System Architecture

4.1.1 Architecture Diagram

Figure 4.1 depicts the architecture diagram of the entire system. The set of modules, along with the control flow between them is depicted.

15 16

FIGURE 4.1: Architecture Diagram 17 4.2 Module Descriptions

4.2.1 Identification of Clickables

Identification of clickables is the first phase in an Ajax Crawler. It involves identifying clickables that would modify the current DOM . The main issue regarding this is that click event may be added to an HTML element in many ways. A number of ways to add event listener is shown below,

• test.onclick=test function;

• test.addEventListener(‘click’,test function,false);

• Using Jquery , $(‘#test’).click(function() { test function(); });

All the above 4 methods, perform the same function of adding the event onclick on element test.

Thus clickables cannot be identified in a standard way because of the numerous Javascript libraries that exist and each has its own way of defining event handlers. So the approach of clicking all the clickable elements is being followed. The list 18 of clickable HTML elements is shown below.

,

, , , , ,
, ,