Dynamically Generated Web Content: Research and Technology Practices Stavros Papastavrou * George Samaras * Paraskevas Evripidou * Panos K
Total Page:16
File Type:pdf, Size:1020Kb
Dynamically Generated Web Content: Research and Technology Practices Stavros Papastavrou * George Samaras * Paraskevas Evripidou * Panos K. Chrysanthis ** * University of Cyprus ** University of Pittsburgh INTRODUCTION Personalization and customization of web services that increase user satisfaction require the delivery of dynamic rather than static Web documents or pages. This means that the content of such Web pages is generated on demand and tailored to a particular web user (e.g., e-banking) or group of users (e.g., delivery of local on-line sport results). Specifically, the term dynamically generated Web content , otherwise known as dynamic web content , or simply dynamic content , refers to chunks of HTML/XML code or media that are generated and combined on the fly to build a requested Web page. Currently, dynamic content constitutes more than 40% of Internet traffic despite the fact that generating Web pages on the fly incurs a major overhead on server resources and increases the response time of the Web servers (Feldmann, 1999). This percentage of Internet traffic associated with dynamic content is expected to keep increasing, especially with the improvement of dynamic content technologies and content middlewares such as application servers, client-side proxies and server-side caches. In this chapter, we take a tutorial approach to present the Web-related technologies and content middlewares that attempt to accelerate the generation and optimize the delivery of dynamic content. We begin the chapter by covering the historical aspects of dynamic content and presenting the reasoning behind its introduction while discussing early content middlewares such as the CGI and FastCGI that enable the execution of external programs and scripts. We then present the evolution of content middlewares along the lines of contacted research. Our discussion focuses on popular techniques that mostly include content caching and content fragmentation at different levels along the communication path from the web client, any intermediate proxies, to the web server and back-end database servers. We also discuss a variety of other research efforts such as hardware and low-level acceleration techniques, active caching, and delta encoding. We conclude the chapter with a discussion on the interplay between Quality of Service such as response time or user-perceived latency and Quality of Data such as freshness. We consider various proposals that attempt to strike a balance between user-perceived latency and freshness of delivered documents which can be broadly classified as client-driven and data- driven . Finally, motivated by growing Web user needs, we discuss future trends of dynamic content technology. HISTORICAL ASPECTS In order to better understand the semantics of dynamic content, we first review the basics of static content. Static content emerged along with the introduction of the World Wide Web (WWW or simply Web) back in the early 90’s. The Web is a vast network of servers linked together by a common protocol called the Hypertext Transfer Protocol (HTTP), allowing access to various forms of content uniquely identified by global addresses called Uniform Resource Locators or URLs. Static content refers to pre-existing files with extensions such as .html, and .jpg that denote their content and purpose. HTML files contain tags that define the logical and visual structure of a Web document. Text content is laid within those tags. HTML files may also contain references/hyperlinks/URLs to other documents or links to graphic and multimedia content such as image and audio files. Web users or clients request a static document by pointing their browsers to it either indirectly via hyperlinks found inside other documents, or directly by explicitly specifying the location and name of the document. Those requests are better known as ‘HTTP GET requests’ and cause the Web server on the other side to issue an ‘HTTP response’ that it immediately followed by the requested static file. The control text of both HTTP requests and responses are hidden from the user. Once a browser downloads a static document, it parses and renders it according to the HTML tags found inside it. References to static graphic content, of the form <img src=image1.jpg> trigger a new HTTP request without any extra effort by the user. Later on, the need to embed basic dynamic information into Web documents, such as the time at the server side, or the last date the document was modified, initiated the Server Side Includes (SSI) as a first step toward dynamic web content. HTML documents written according to SSI included tags of the form <!--#echo var="DATE_LOCAL" --> and carried the file extension of *.shtml instead of the traditional *.hmtl. An HTTP request for a document with that file extension triggered the Web server to behave differently. Instead of submitting the actual contents of the document (as stored on the file system), the Web server parses the document in and substitutes the SSI tags with the corresponding dynamic information. In addition to tag substitution, SSI introduced another breakthrough for Web development by providing support for an HTML document to include another one by simply using an SSI tag of the form <!--#include file=“top_menu.html” -->, pointing the way for modular Web development. The quick adoption of the Web, first by the academic community and then by the business community, led to a whole new class of Web-based applications. The key element in the popularity and success of this new class of Web application was the ability to serve Web users with customizable content. The key requirement was that an HTTP request for a document, namely http://www.scocks.com/mystocks.cgi, produced different content for each Web user under certain situations that we discuss next. The evolution came along with the introduction of the Common Gateway Interface, or CGI. According to CGI, a Web server was able to execute programs and return their output to Web users. In this case, the Web server was more like a mediator or a gateway between the Web users and the executable programs, rather than being a simple file dispenser. Nonetheless, the big question was ‘how a CGI program could distinguish between different users’? A Web user causes a CGI program, for example mystocks.cgi, to execute by issuing an HTTP GET request for it that triggers the gateway functionality on the Web server. The latter will then fork a new process to execute that file. However, in order to make that call meaningful, extra client-related parameters must be provided to the CGI program, such as the ID of the user and the stock quote. One way to achieve that is by appending a list of variable names and values next to the file name of the CGI program. For example, for a user with ID=123 and stock quote for Intel, the target file of the HTTP GET request would target the URL www.stocks.com/mystocks.cgi?id=123&stock=Intel . The given variables and values in this case are called URL parameters and are passed as input to the CGI program by the Web server. The CGI program queries a local (or remote) application database(s) to retrieve data concerning the user with id=123 and his stock preferences for the stock=Intel . It then dynamically constructs HTML code enriched with the retrieved data. Finally, the complete result is sent to the Web user via the Web server (our Gateway). Another popular way of passing parameters to a CGI program is by submitting the contents of HTML forms to the Web server. Similar to our previous example, the Web user uses two texts boxes inside an HTML form and fills in the ID and the stock symbol. Upon submission of the form, the parameters are not placed next to the file name of the CGI program as URL parameters. Instead, they are appended at the end of the HTTP request, which in this case is called an HTTP POST request. Both HTTP-based request approaches, the HTTP GET, which uses URL parameters, and the HTTP POST which uses HTML form input, are still very popular mechanisms for modern Web applications to distinguish between individual users and situations. In the rest of this chapter, we will refer back to these mechanisms while describing more recent content middlewares. FROM CGI TO FASTCGI CGI suffered from one major drawback that surfaced soon after its introduction. Every client HTTP request for a dynamic page required the forking of a new process on the middleware by the Web server. In the case of having to serve multiple concurrent requests, this approach led to decreased performance since the server system spent more time between context switching, forking and terminating processes rather than actually serving the client requests (a phenomenon better known in the literature as thrashing ). Besides scalability problems, this persistent-less nature of the CGI processes introduces a number of other inconveniences. The most important, in our belief, is the inability to keep open (persistent) database connections to the application database(s). Persistent database connections accelerate the processing of client requests since they are immediately available to the middleware processes for submitting a series of queries. As we discuss later on, it is a common practice for modern content middlewares to assign an open database connection to each running process. Another major shortcoming of CGI is that consecutive client requests have to be reconnected each to a different middleware process. This practice, however, does not enable the use of a single network connection between a client and a middleware process for the complete duration of a client session. To address the shortcomings of CGI, FastCGI was introduced. Its revolutionary design allowed for middleware processes to be persistent and execute in a different context than that of the Web server.