Paper on Web Management
Total Page:16
File Type:pdf, Size:1020Kb
Managing the Web at its birthplace Prepared for the South African Annual Conference on Web applications (1999-09-09) To be published in the South African Journal of Information Management Maria P. Dimou - CERN Dariusz J. Kogut (for the TORCH section) Abstract CERN, the European Laboratory for Particle Physics is the place where the Web was born. It has the longest tradition in Web-related development, however, its main role is to do basic physics' research. This ambiguous combination of interests makes the task of managing the Web at CERN most challenging. The large amount of Web servers to consolidate, the users' high expectations on performance and centrally supported tools, the expression of individual creativity within some corporate image standards creates a non-trivial Intranet operation. In this paper the CERN central servers' configuration will be discussed. The choice of commercial and public domain software for the server, search engine, Web discussion forums and log statistics will be presented. The guidelines and recommendations on Web authoring and site management will be listed as well as the criteria that led us to these conclusions. Introduction Managing a very large Intranet and, by the way, the oldest one, may seem at first glance to be a straight forward technical issue but it is more than that. It touches upon very interesting human and social aspects like: • who can publish • what can be published • how to promote important information • what is important information • who is responsible for the information quality • what is creative/what is bad taste • how independent are the data structures from the content • how uniform the Intranet appears Our Intranet is far from being the most attractive and conscise, however, the following statements are all true and explain or justify the situation: • our "lack of discipline" has also its advantages as it led to achievements like the World Wide Web. • the main objective of the CERN authors and web server managers is to do High Energy Physics research, therefore they never focus on their sites' aesthetics. • Physicists are used to build entire projects by themselves and solve problems across disciplines (physics, engineering, computing) therefore hate conforming to restrictions on corporate image, accepted without problem in business environments. Experience has shown that trying to over-centralize and control the presentation, location, platform or number of the CERN web page/sites/servers is a utopia, anyway contrary to the design and philosophy of the Web, the "universe of network-accessible information" (Tim Berners-Lee). What we found realistic and valuable was to offer solid and open solutions that would prove themselves performant and simple to adopt or comply. These solutions were made available on the central web server www.cern.ch for those who have no resources to set-up a service of their own. They were also documented as tools, courses and guidelines for those who want to stay independent and run their own service, with the possibility to get advice. This work covered the period of 1997-98 and is summarized here. As people and strategies change there is no commitment that the policy (or URLs!) of this paper will remain valid in the distant future. The choice of web server CERN was, of course, using the home-made CERN httpd from the web's birth till June 1998 when the central server www.cern.ch converted to Apache (some other servers on-site had done this earlier). It has been amazing to notice the performance degradation observed during the last year of use of the CERN httpd and it was very hard for us to identify the reason for that blockage. We upgraded the hardware, doubled the CPUs, added memory, increased the afs cache, monitored the impact of cgis running on the host, to witness in practice that the CERN httpd simply doesn't scale enough over a certain number of hits (~100K requests/day or on average 4 requests/sec). This was suggested by Jes Sorensen/CERN and proved to be true after we installed Apache and concluded to the following values for the httpd.conf tuning parameters: • MaxKeepAliveRequests 100 • MaxSpareServers 40 • MaxClients 250 • MaxRequestsPerChild 1000 • MinSpareServers 20 Apache proved to be a very powerful Web server.What we appreciated the most in Apache was the use of Virtual Web Servers, Server Side Includes and the handling of page protections. The Virtual Web Servers allowed us to host those web sites that didn't have resources to maintain their own system any more but wanted to keep their identity (URLs) unchanged. The Server Side Includes helped us create and maintain a uniform look on our service pages and every time site-wide changes were needed the update had to be done only in one place. Most information owners want to know who visits their pages and ask for the possibility to access the Apache logs and extract the information that concerns them. We chose the analog package to do this by making available to the user a list of fields and tailorable parameters they can select and make statistics reports for the period, the criteria and the format that suit them. Another preoccupation every system manager has when running a web server on his/her system is the potential security risks of CGI scripts. To ensure some safety we decided to keep all scripts in one place. Their authors aren't allowed to move them there by themselves but need to submit them to a programmer of the web support team for checking of potential security holes. Some documentation was made available recently on typical errors in Perl with shell commands and file I/O. (Co*)authoring (*)"L'enfer c'est les autres" J-P. Sartre The challenges in this area depend: • on the site's size (one page, a few pages, hundreds of pages?) • on the documents' history (newly created, legacy papers) • on the documents' "mission" (to print or to link) • on the documents' complexity (images, dynamic content) • on the number of (co-)authors • on their working platform • on their preferred authoring tool (package to edit pages and manage web sites) The above do not concern page/site style, fonts and content but only version consistency, document portability and write permissions. We started by making an inventory of the tools available in the market for page editing, site management and image processing. It was immediately obvious that it is a never-ending task to keep it up-to-date, evaluate all of the packages or support more than a handful of them. Therefore, we evaluated a subset of these products against a list of criteria and concluded that: • legacy documents that traditionally existed on paper and should continue to be printable with a rigid format and page numbers should be published as pdf or postscript. • users who have a very small number of pages to write should be left free to use their preferred editing tool and save them in a web-viewable format (e.g. HTML or pdf). • Unix users are usually happy to write plain HTML and should be left alone till their sites become relatively large (over 50 pages). • PC and Mac users enjoy the availability of a large number of editing/management tools and should choose the most standard and open ones, i.e. those creating portable code across platforms and server software. The products we actually examined more carefully were: • MacromediaDreamweaver (for Windows and Macs) • GoLive CyberStudio (excellent but Mac-only at the time) • MS FrontPage 98 (very good if one stays within Microsoft products) • NetObjects Fusion (good for very large sites, several hundred pages, expensive) • Adobe PageMill (limited functionality) Their page editing facilities are very similar i.e. link insertion, list,template and table creation are quite easy etc. The worries start when proprietory add-ons reduce the portability possibilities of the end-product. We needed to come up with a recommendation in a finite amount of time and suggested the Macromedia Dreamweaver for medium-sized sites not for its perfection (there are bugs!) but for its open,straight- forward, W3C standards' compliant features. Some of them, not exhaustive nor exclusive are: • HTML hiding • Clean HTML production • Easy link,table,image,font,metadata insertion/change • Easy form creation Templates (local or remote) & Library elements • Page result preview in user's preferred browsers • Preferred external editor invocation • Link validity checking • Easy site definition and uploading to server • Easy maintenance of any site on remote Unix or Windows system • Site map automatically built • Safe editing by >1 authors • Help with scripting,Javascript, DHTML • Server Side Includes (SSI) Support (also locally) • Common plug-ins' insertion • Cascading Style Sheets' (CSS) support • XML support / parser From the page usability point of view one of the most serious problems we have due to the large number of authors (over 1,000) and the lack of coordination is the quality of the page content. Pages are written and forgotten, authors leave the organization, users don't know whom to contact to find out what is still valid. Some of our collaborators in the 20 member States of the laboratory or the rest of the world have modest network connectivity and can't use sites heavily loaded with images and animations. For these reasons we issued guidelines to authors explaining that on every page there should be: • a signature for the readers' information • a mail address for feedback • a date when the page might expire or need review • some concern for users with slow lines (no page over-load with pictures etc) • the appropriate metadata for promoting the pages in good ranks of search results • a robots.txt file at the level of the site's document root that prevents search engines from unauthorised indexing • good content of the TITLE tag that corresponds to the page' mission • ALT attributes of IMG elements that provide a text description of an image (vital for interoperability with speech-based and text only user agents).