Brass: Web-Based Queuing System for Warrick

Master’s Project Final Report

Author: Amine Benjelloun Email: [email protected] Modified by Frank McCown

Project Advisor: Dr. Michael L. Nelson Email: [email protected]

Project Presentation Date: Apr 20th, 2007

Department of Computer Science Old Dominion University

Table of Contents Brass: Web-Based Queuing System for Warrick...... 1 1- Abstract:...... 2 2- Introduction:...... 2 2-1 Background about Warrick:...... 2 2-2 Need for the Project:...... 2 3- Project Details:...... 3 3-1 Project Architecture:...... 3 3-3 Database:...... 5 4- Installation Guide...... 7 4-1 Installing Tomcat...... 7 4-2 Installing Warrick...... 8 4-3 Installing Brass...... 8 4-4 Compiling...... 9 4-5 Adding a new node...... 9 5- Appendix: Source Code Description...... 9 1- Abstract:

Warrick is a tool built as part of a research project by Frank McCown, a Ph.D. student at Old Dominion University. It is a very useful tool that is used for reconstructing lost websites when backups are not available. The way it works is by searching and retrieving web resources from web repositories. Web repositories are web archives like the Internet Archive and search engine caches like Google, MSN, and Yahoo. Using Warrick requires downloading, installing, and reading its instructions before one can run it. This can be sometimes a bit difficult for the average non-technical user. My project, Brass, is to build a web interface that uses Warrick and which allows an individual who lost their website to recover it by filling out a simple form. A single server could easily manage the expected number of queries submitted by Warrick during the reconstruction process, but these web repositories limit the daily number of queries coming from a specific IP address. In addition, Google used to provide some keys that are needed to query their system and mainly to restrict the daily number of these queries to a 1000 per key. Google does not provide these keys anymore, but continues to support the existing ones, which have become a scarce resource now. So, given all these limitations, and in order to maximize the number of requests and website recoveries that can be performed daily, Brass uses multiple servers and provides an automated management which includes: creating a queuing system, launching Warrick with the appropriate parameters, launching the next job in the queue on the next free machine, and notifying users when their websites become available. It also includes an administrator’s interface to manage the system and manually perform these tasks.

2- Introduction:

2-1 Background about Warrick: From http://warrick.cs.odu.edu/warrick.html:

Warrick is a utility for reconstructing or recovering a website when a back-up is not available. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your file system. Warrick is most effective at finding cached content in search engines in the first several days after losing the website since the cached versions of pages tend to disappear once the search engine re-crawls your site and can no longer find the pages. Running Warrick multiple times over a period of several days or weeks can increase the number of recovered files because the caches fluctuate daily (especially Yahoo's). Internet Archive's repository is at least 6-12 months out of date, and therefore you will only find content from them if your website has been around at least that long. If they don't have your website archived, you might want to run Warrick again in 6-12 months. (2)

Please consult the following for a more complete description of Warrick and Brass: Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). June 23, 2007. Vancouver, British Columbia, Canada.

2-2 Need for the Project: Some of the main points that show the need for this project are the following:  Using Warrick requires an individual to download it to their system, install it, and read its instructions before they can use it. While these tasks are easy and straight forward for a technical user, they can be a bit difficult for the average non-technical user.  Web repositories limit the daily number of queries that can be performed on their system coming from a specific IP address. The following table shows the actual numbers allowed by each web repository:

Web repository Requests per 24 hours Internet Archive 1000

Google 1000 Yahoo 5000

2 MSN 10,000

 Also, Google used to provide some keys that are necessary for Warrick to query their system. Google stopped providing these keys, but they still support the existing ones. So these key have now become a scarce resource and need to be efficiently managed. For these reasons and limitations, there is a need for a project that will, in simple terms, establish and manage a queuing system for Warrick, provide an easy way for an individual to submit their request (to recover their websites), recovers the lost website, and finally let them know when their website becomes available so they can download it to their system.

3- Project Details:

3-1 Project Architecture:

Figure 1: Brass Architecture

Figure 1 shows the big picture of the architectural design for this project. The left side of the figure gives details about the interactions that take place between an individual who is trying to recover their website and Brass. They will first submit a request through a simple HTML form (see figure 2), where they would have to give their name, email address and the URL for their lost website. Once they have submitted their request, they will get an instant email message asking them to confirm their request (necessary before it can be processed), and providing them with two links: the first one can be used to confirm their request, and the second one can be used to check on the status of their request. An individual could get a number of reminders to confirm their request while it is pending. They can check on the status of their request any time, and they will get some real time information regarding the recovery process. This information consist of the number of URLs that have been processed, the number of URLs that have been recovered so far, the number of URLs that are still to be processed, as well as an estimated percentage of the overall progress. When the recovery is completed, the individual will get a separate email providing them with a link to download a compressed file of their whole website. They could get some more email reminders to pick up their website if they have not done so.

3 Figure 2: New request form Brass requires a shared file system with Tomcat installed. All the classes used in this project are shown in the middle of figure 1, the classes in green are the ones accessed by both individuals who lost their websites and administrators (example: ConfirmRequest, ViewDetail, and NewRequest). With the tomcat server started on a machine, the Main class is a servlet listening to all requests coming at all times. Administrators have to authenticate themselves before they can access an interface to manually perform the necessary tasks (more details in section 3-5). The Main class takes the initial settings from a configuration file: web.xml (more details in section 3-4). Brass uses two files as database: Jobs.xml and History.xml (more details in section 3-3). Many machines can be deployed. In the next section shows how all the machines communicate among each other.

3-2 Communication among machines:

Figure 3: Communication among machines

Figure 3 gives some details about the communication process among all the machines being deployed. There is one master machine (in this example: Andromeda) and a number of slave machines or worker machines (like Cash, Tango,

4 Ra, and Isis). Virtually, any number of machines can be deployed, and the master machine can process jobs as well. The master machine, runs its CleanUp routine, gets a list of free machines (machines that are not running any instance of Warrick) and the same number of jobs from the top of the queue. The master machine then dispatches these jobs on the free machines. Figure 4 shows the actual command invoked on a slave machine (Cash in this case) to start its AssignJob routine.

Figure 4: Calling AssignJob on a slave machine

Once AssignJob is called on any machine, it will create a launcher script (see figure 5). It will get Warrick command line from the database file, it will add it to the script, then it will build the http request to call the "CleanUp" class, it will append it to the script, and finally it will execute that script. So, the script will execute Warrick with the appropriate options, it will wait until Warrick finishes executing, and will then call the master machine’s CleanUp routine, which in turn will look for some more jobs to dispatch (among many other tasks).

Figure 5: Call CleanUp on master machine

3-3 Database: Brass uses two files as a database:  Jobs.xml: it consists of two parts. The first one keeps track of the names of the machines being deployed as we as their status, whether they are busy running some instance of Warrick or free. The second part holds all the details about the submitted jobs, details like the name, email, lost URL, the submission date …  History.xml: this is a secondary database that will only hold details of the jobs that have been completed whether that have been picked up by individuals or not.

5 Figure 6: Database File

3-4 Configuration File: This file contains parameters that will likely to be changed frequently over time. So this is the place to modify their values without having to go to the code, look for all the places where they would have to be modified, and having to recompile the code again. Some of these parameters are: the location of the HTML files, the location of the classes, the path to Warwick’s executable file, the administrator’s email address, and many more parameters.

3-5 Administrator Interface: Figure 7 shows how the administrators Interface looks like, as well as all the different tasks that an administrator can manually perform. By accessing the database file, all the jobs are groups into 4 separate tables depending on their status (pending, inQueue, processing, or complete). Every table provides a list of applicable actions that an administrator can perform. An administrator can deploy or un-deploy a machine by using one of two buttons provided, this will add that machine or remove it from the list of machines to be used by Brass. The administrator needs to make sure though, that Tomcat is started on that machine once it has been deployed. They can start or suspend a job on some machine, they can move a job within the queue of jobs waiting to be processed. Within the same page an administrator could easily see for how long a job has been sitting in a specific status shown in the age column; which will turn red after some number of days preset in the configuration file. They can see real time progress of all the jobs that are processing. This progress is attained by accessing the log file generated by Warrick and getting the necessary values from it; values like the number of URLs that have been processed, recovered so far, still in the queue waiting to be processes, and an estimated percentage of the overall progress. Also, with a click on a specific job, they get all the details about that job. The green bar represents an estimates percentage of the overall progress of the job. This percentage is obtained by the following equation: Estimated % = # processed URLs / (# processed URLs + # remaining URLs) The page can be set to automatically refresh itself (On option), if it is set to “On”, the page will reload every 60 seconds. That value could easily be changed on the address bar to any number of seconds. Brass uses a locking mechanism to ensure that the database is not accessed (for modification) by more than one application at a time. The lock is simply a text file ”Lock.txt” that gets created before an application can get access to the database and gets deleted after the application has finished making all the necessary modifications. If the lock is still acquired by some previous application, the current application would simply not make any modifications to the database. The two options on the top right corner to “Release” or “Get” the lock can be used if needed. A typical scenario would be: if an administrator wants, for some reason, to stop or pause the system from going through the queue, they could manually acquire the lock (Get option). Another case would be, if some machine dies before releasing the lock, the system will not process any more jobs than the ones currently processing. So in this case, an administrator could release the lock manually (Release option). The CleanUp button could be used to initiate the system for the first time, after restarting Tomcat server on some machines, or after deploying one or more machines. Once the CleanUp routine is called, it will start as many new jobs as there are free machines. The administrator could also manually start a job on some machine. This will only start that specific job even if there may be some free machines. Once that job finishes, the CleanUp routine will be automatically called to start as many jobs as possible. The CleanUp routine will also look for any reminder emails or notification emails that need to be sent, it will update the status of jobs that have completed, and get a compressed file of any recovered website ready.

6 Figure 7: Administrator’s Interface 4- Installation Guide These are the steps to follow in order to make Brass work.

4-1 Installing Tomcat  Follow the instructions to install and configure Tomcat http://tomcat.apache.org/  Create the following directories: TOMCAT_HOME /webapps> mkdir myapp TOMCAT_HOME /webapps> mkdir myapp/WEB-INF TOMCAT_HOME /webapps> mkdir myapp/WEB-INF/classes  If needed, change the default port number (8080) in the following file: TOMCAT_HOME /conf/server.xml  Replace by by in the following file: TOMCAT_HOME /conf/context.xml  Add the following lines to your .cshrc file and “recompile” it: setenv TOMCAT_BASE set TOMCAT_HOME = ($TOMCAT_BASE $JAVA_HOME $CLASSPATH) setenv CLASSPATH ${CLASSPATH}:$ TOMCAT_BASE /server/lib/catalina.jar"  “recompile” your .cshrc file: /home/~username> source .cshrc

7 4-2 Installing Warrick  Follow the instructions to install Warrick http://www.cs.odu.edu/~fmccown/research/lazy/warrick.html#downloading  Google keys must be placed in the google_key.txt file along with the machine name (separated by tabs) as follows: key1 machine1 key2 machine2

4-3 Installing Brass  Download Brass.tar.gz to some public directory http://www.cs.odu.edu/~abenjell/Brass_TarBall/Brass.tar.gz Example: /home/~username/public_html/  Untar Brass.tar.gz Example: /home/~username/public_html> tar xvfz Brass.tar.gz  Create a directory that will hold the recovered files Example: /home/~username/public_html> mkdir recons  Set access permissions to the reconstruction directory Example: /home/~username/public_html> chmod 701 recons  Make sure these access permissions for all the files and directories are set as follows: /home/~username/public_html> chmod 701 ./brass /home/~username/public_html> chmod 744 ./brass/* /home/~username/public_html> chmod 701 ./brass/Admin /home/~username/public_html> chmod 744 ./brass/Admin/* /home/~username/public_html> chmod 704 ./brass/Admin/.*  Default usernames and passwords for brass/Admin directory: admin (kalima) and admin2 (kalima2) User names and passwords can be modified and/or added by modifying the following files: o /brass_home/Admin/.htaccess (user names) o /brass_home/Admin/.htpassbrass (hash of passwords) Many tools to create hash of the passwords: Example: http://www.kxs.net/support/htaccess_pw.html  Move the directory /brass /Brass to the Tomcat servlets directory /home/~username/public_html> mv ./brass/Brass /webapps/myapp/WEB-INF/classes/  Move /brass /Admin/web.xml to : /Tomcat_home/webapps/myapp/WEB-INF/  Open web.xml, it has some pre-defined values. It needs to be modified to reflect the corresponding values of: o JobsFilePath: the path to Jobs.xml (database) Example: /brass_home/Admin/Jobs.xml o HistoryFilePath: the path to History.xml (for old records) Example: /brass_home/Admin/History.xml o BaseReconDir: directory where the recovered files would be stored Example: /home/~username/public_html/recons/ o PathToWarrick: the path to Warrick executable file Example: /home/~username/Warrick/warrick.pl o BaseHtmlDir: the path for the base directory of all the HTML files Example: /home/~username/public_html/Brass/ o machineName: the name of the main machine (master machine) Example: cash.cs.odu.edu or c24.seven.research.odu.edu o classFilesPath: the path to the Java classes (starting with the port number) Example: :6022/myapp/servlet/ o BasePickUpDir: directory where the compressed recovered websites would be stored. It should point to the same directory as “BaseReconDir”. Example: http://www.cs.odu.edu/~username/recons/  Modify the name of the master machine and the path to main class with their actual values in the following files: /brass_home/Admin/index.html /brass_home/Admin/new.html /brass_home/new.html

8 /brass_home/confirm.html /brass_home/status.html  Start Tomcat on the master machine and on the machines that will be deployed Example: TOMCAT_HOME /bin/startup.sh  Finally, go to administrator’s main page: Example: http://www.cs.odu.edu/~username/brass/Admin  If you modify one or more classes and need to rebuild the project: javac *.java

4-4 Compiling To compile Brass, the CLASSPATH must be set to include the Tomcat jar files: setenv CATALINA_HOME '/home/fmccown/apache-tomcat-6.0.10' setenv CLASSPATH "${CLASSPATH}:$CATALINA_HOME/lib/catalina.jar:$CATALINA_HOME/bin/tomcat-juli.jar"

Then the Main.java can be compiled which will pick up any changes to other java files: cd TOMCAT_HOME/webapps/warrick/WEB-INF/classes/Brass javac Brass/Main.java

4-5 Adding a new node To add a new node, Tomcat must be first installed on the node along with Brass. If Brass was installed on a shared filesystem is being used, then no additional work is needed. Tomcat must be started before the node is deployed: TOMCAT_HOME /bin/startup.sh

Now the node can be added using the Administrative Screen.

5- Appendix: Source Code Description Main servlet: Main.java: This is the only servlet in the system. The calls to all the classes go through the Main class with some flag set to the corresponding value. Depending on that flag, Main performs the necessary actions like getting the corresponding parameters ready, and calling the corresponding class.

New Jobs: NewRequest.java: This class takes values from a new job request (example: name, email, URL …) and adds a new entry in the database with all the initial values. LostKey.java: Individuals, who lost the unique identifier for their job, can go to an HTML page, provide their email address to get their key sent back to them. The LostKey class takes an email address, goes through the database, and for every corresponding entry, it will send an email containing the key.

Actions performed on a job: ConfirmRequest.java: Given a key (unique identifier), this class looks for the corresponding entry in the database, changes the “status” tag from “pending” to “inQueue”, and the “confirmDate” to the current date. DeleteRequest.java: Given the key identifier, this class looks up the corresponding job, and deletes from the database. ChangeQueue.java: This class takes a key, and an action. According to the desired action, the corresponding job could be moved within the queue; all the way to the top of the queue, all the way to the bottom, one level up, or one level down. GetNodeToAssign.java: This class takes a node and a key, and calls the AssignJob class on the corresponding machine. If there is no specific machine, it will look for the first free machine and start that job on it. GetNodeToSuspend.java: This class also takes a node and key, and calls the SuspendJob class on the corresponding machine. AssignJob.java: Given a key, this class will go through the database, get the corresponding Warrick command line. It will create a launcher script that will run that command line, wait until Warrick finishes executing and then call the CleanUp class. It will then update the some tags like the “assignDate”, the “node”, and the “processID” in the database file. SuspendJob.java: This class also takes a key. And by going through the database, it gets the corresponding processID, kills that process, and makes the appropriate changes to the database.

9 CleanUp.java: This class performs many tasks. It is called with many parameters from the configuration file. It goes through all the jobs in the database file. For the jobs that are pending (waiting to be confirmed), it will send more email reminders if needed and delete any jobs that have not been confirmed for more than some number of days. For the jobs that are in the queue, it will take as many jobs (from to top of the queue) as there are free machines and start those jobs on those free machines. For every job that is processing, The CleanUp class will access the corresponding log file generated by Warrick, it will check if Warrick has finished executing in which it will make the necessary changes to the job details, it will compress the recovered files, and it will send an email to the individual providing them with instructions to download their website. It will also check if the log file has not changed for some number of hours, it will suspend that job, it will put it on the top of the queue, and it will send a summary email of what happened to the administrator. For every job that has a status of “complete”, CleanUp will send some more reminder emails to pick up if needed. If the job has been picked up, it will keep it for some number of days, then it will delete it. If the job has not been picked up for some number of days, they it will delete it. In either case, CleanUp will delete the corresponding entry from the database (Jobs.xml), and add a record entry in the secondary database (History.xml).

Different views of the jobs: ViewJobs.java: When called, this class takes a status (pending, inQueue, processing, or complete), and gets all the jobs that have currently the status wanted. It displays all of them in a table format, and provides a set appropriate actions that an administrator can perform (move a job in the queue, suspend, start, or delete a job …) ViewAll.java: This class is simply a recursive call to ViewJobs.java with all four different statuses. It is used in the administrator’s main interface to show all the jobs grouped by their status with all the different tasks that can be performed on them. ViewDetails.java: This class can be called by an administrator as well as by a regular user. This class takes a key and, depending on who is calling it, it will display some specific details about the corresponding job (in the case of a regular user) or all of the details (in the case of an administrator). This class will also display details about any other jobs related to the same email address.

Other classes: SendMail.java: This class is called by NewRequest, LostKey, and CleanUp. It takes a ‘from” address, “to” address, ‘subject”, an email body, and sends the corresponding email. UpdateNodes.java: Given a machine name, and an action (deploy or undeploy), this class adds a machine to or removes it from the list of deployed machines. ReleaseGetLock.java: This class is called with two options: Get or Release. The Get option will create a text file used as a locking mechanism (Lock.txt), and the Release option will delete that file. PickUp.java: This class takes a key identifier; it prepares a link for the individual to download their recovered website. And updates some details about the job; it sets the “pickedUp” tag to “yes” and the “pickUpDate” to the current date.

Jobs.xml Tags definition: : name of a machine being deployed (example: isis, tango…) : state of the machine. It could be “free” (not running any instance of Warrick), or “busy” (running at least one instance of Warrick). : The 32 characters unique identifier of every job. : The first name of the individual submitting the job. : The last name of the individual submitting the job. : The email of the individual submitting the job. : The URL of the lost website. : The command line for Warrick with the appropriate options. : The date and time of the submission of the request. : The status of the job, it could “pending”, “inQueue”, “processing”, or “complete”. : The number of confirmation reminders sent to an individual. : The time and date of the last confirmation reminder sent to an individual. : The date and time of the confirmation of the request. : The date and time of the last time the job was assigned to some machine. : The name of the machine that is processing the current job. (Example: isis.cs.odu.edu or c24.seven.research.odu.edu) : The process ID of Warrick instance that is processing the current job. : The date and time of the completion of the job.

10 : The total number of URLs that have been processed during the recovery process. : The total number of URLs that have been recovered during the recovery process. : The number of pickup reminders sent to an individual. : The time and date of the last pickup reminder sent to an individual. : The date and time when current job was picked up. : shows whether the current job has been picked up or not. Could have the value of “Yes” or “No”.

11