The Seven Pillars of Operational Wisdom: Selected Topics in Blackboard System Administration

Abstract The Blackboard Learning Management System presents many challenges for the system administrator due to its complexity. Some of these challenges are common to production web applications while many others are results of the design and implementation of Blackboard. Seven key areas of system administration are explored in terms of both theory and practice and illustrative problems, root causes, methods of diagnosis and solution (if any) are discussed. The areas are version control, configuration management, monitoring, automated recovery, restore and backup, security, and “operational hygiene” -- other Blackboard-specific operational issues. The tools and techniques described are for a load- balanced Enterprise system running on , but much of the underlying theory and practice applies to all platforms.

Preamble, Caveats, and Intent This work is based on a few assumptions about the audience's goals and understanding of system administration in general and the design of robust, scalable web systems in particular. First, institutionally we are constrained by budget and knowledge – we cannot be expected to have enterprise Oracle or Java web application design and analysis experience, and we cannot afford to throw money (hardware, consultants, diagnostic tools, certifications) at problems to solve them. Our customers and management have high expectations for reliability and as administrators we will do everything reasonable to meet those expectations, including remote and on-site administration at all hours. Our goal is a to maintain a cost-effective and reliable learning management service that minimizes the consumption of our limited time, budget, and patience.

I. Revision Control The notion of observability and repeatability are just as important in system administration as they are in the physical sciences. Similarly, consistency of manufacture is as important when producing a physical product as when deploying a new server. In software development, a revision control system or repository is used to maintain an audit trail or changelog of source code as well as a means of retrieving past versions of the code. “Code” typically refers to textual data but can refer to binary objects as well such as images, audio files, etc. In terms of Blackboard administration, a revision control system provides us with the means to store, annotate, and retrieve configuration files. New Blackboard systems are usually deployed after an iterative process of testing and reconfiguration, with the final tested configuration placed into production. In a horizontally-scaled (load-balanced) architecture, it's vital to deploy an identical known good configuration to all hosts in the web/application tieri. Further, some problems may not be found immediately, so it's important to be able to revert changes back to a known good configuration. We can achieve both goals of deploying identical configurations and reverting changes back to a known good state by keeping configuration files in a revision control system. Another benefit of using a revision control system is that it gently formalizes the change process. Rather than logging into a production server and directly editing a configuration file, a working copy is checked out from the repository to the local host, edited, and checked in or committed back to the repository with an annotation explaining the reason for the change. The new version is then checked out on the production servers. If there are any problems, the changes can be reverted by checking out the previous approved version of the file. The file on the production servers can be compared to what's stored in the repository to show differences, the changelog will show the history of changes to the file, and a semblance of observable and repeatable change management will have been achieved. The administrative overhead of deploying a revision control system is small, considering the two most common revision control systems are well-documented, free, and supported in most major text editors and integrated development environments (IDEs) on all major platforms. Setting up a repository on a server is fairly straightforward, simplified by the availability of high-quality documentation and binary executables for all major server platforms. We have found the largest cost has been in time spent training staff to adapt to the new workflow. As useful as a revision control system is, it is only the first step in managing server configuration. There are many other aspects of server configuration that are too complex to manage with only a revision control system.

II. Configuration Management Configuration management (CM) is the practice of positively controlling aspects of a host's local configuration. Configuration resources that can be managed include:

• configuration files

• users and groups

• file permissions and ownership

• mounts and mount points

• software (installation, removal, version)

• network interfaces

• daemons and services (running and disabled) Server configuration is initially set at build time and is kept under control over the server's lifecycle by a CM system installed at server build time. Initial configuration is controlled by using an automated server build system such as Jumpstart (Solaris) or Kickstart (Red Hat Linux, CentOS) or, alternately, by using a disk imaging system such as Norton Ghost, g4l, or g4u. Disk imaging systems are not recommended since they require manual partitioning of disks and are not as robust as automated build systems. Unattended system builds can be performed by using a DHCP server on an isolated build network, PXE (network) boot, a software repository (often an installation CD copied into the document root of a webserver), and a build configuration file. # Kickstart file automatically generated by anaconda. install url --url http://repository.stedwards.edu/centos/4.4/os/i386/ lang en_US.UTF-8 langsupport --default=en_US.UTF-8 en_US.UTF-8 keyboard us xconfig --card "ATI Mach64 3D Rage IIC" --videoram 4096 --hsync 30-81 --vsync 56- 76 --resolution 800x600 --depth 16 network --device eth0 --bootproto dhcp rootpw --iscrypted $1$.qwertyuiopasdfghjklzxcvbnm1234 firewall --disabled selinux --disabled authconfig --enableshadow --enablemd5 timezone --utc America/Chicago bootloader --location=mbr # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work clearpart --all --drives=sda part /boot --fstype ext3 --size=100 --ondisk=sda part pv.10 --size=0 --grow --ondisk=sda volgroup VolGroup00 --pesize=32768 pv.10 logvol swap --fstype swap --name=LogVol01 --vgname=VolGroup00 --size=1000 --grow --maxsize=2000 logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow

%packages @ dialup @ mail-server -dovecot grub -spamassassin kernel-smp e2fsprogs lvm2 # Add nice things to have lynx zsh # Remove unnecessary bloat on server -bluez-bluefw -bluez-hcidump -bluez-libs -bluez-utils -finger -irda-utils -isdn4k-utils -mysql -NetworkManager -pcmcia-cs -up2date -wireless-tools

%post

/sbin/chkconfig cups off

/bin/rpm -i http://maxie.it.stedwards.edu/depot/aide/aide-0.11-1.rf.i386.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/ruby/ruby-1.8.6-3.i386.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/puppet/facter-1.3.7-1.noarch.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/puppet/puppet-0.22.3-1.noarch.rpm # update /etc/sysconfig/puppet with proper puppet server Table 1: Example kickstart.cfg file

Alternately, the same process could be used to create a standard image of a virtual machine to be run under VMWare, Xen, or an equivalent system. What matters is that known good initial configurations can be repeatably produced. Once one has defined a standard, minimal operating system installation including a configuration management system client, it is possible to use the CM system to customize a generic server to its specific role and keep its configuration under positive control. There are many free, production-quality CM systems available for systems (bcfg2, cfengine, lcfg, puppet.) In our case, we deployed Puppet on unmanaged servers and gradually normalized the configurations. This allowed us to gracefully introduce configuration management and see immediate benefits as we learned more about Puppet. By storing Puppet manifests (configurations to be applied to a target system) in a revision control system we were able to deploy and revert configurations without logging into the target systems and were assured that configurations would be kept consistent. Any local changes made to a resource managed by Puppet would be overwritten by the canonical configuration in a few minutes and a log of the change would be made. node 'zim.it.stedwards.edu' { include aide include apcupsd include base include blackboard_web_tidy include dependency_crond include dependency_dns include dependency_ldap include dependency_ntpd include dependency_sendmail include dependency_sshd include diagnostics include ntp include mail_aliases include munin_node include puppet_client include resolver include sendmail_cf include service_monit include shells include software_depot_client include sudo include syslog include users_blackboard include users_compserv include users_itec include users_monit include users_munin include utility_philesight } Table 2: Puppet node description for Blackboard web/application server

class sudo { file { "sudoers": path => "/etc/sudoers", owner => root, group => root, mode => 440, source => "puppet://repository.stedwards.edu/dist/apps/sudo/sudoers", } } Table 3: Puppet manifest for managing /etc/sudoers file class users_blackboard { # create bbuser group group { "bbuser": gid => 502, ensure => present, provider => groupadd }

# create bbuser account # NB: Standardize on bbuser uid user { "bbuser": comment => "Blackboard Role Account", home => "/home/bbuser", uid => 502, gid => bbuser, groups => [ "itec", "root" ], shell => "/bin/bash", allowdupe => false, ensure => present, membership => minimum, provider => useradd }

# create /home/bbuser file { "/home/bbuser": path => "/home/bbuser", owner => bbuser, group => bbuser, mode => "0700", ensure => directory }

} Table 4: Puppet manifest for creating Blackboard user and group (bbuser.bbuser) class blackboard_web { # create bbsupport account

# create /usr/local/blackboard # NB: Set permissions and ownership file { "/usr/local/blackboard": path => "/usr/local/blackboard", owner => "bbuser", group => "root", ensure => directory }

# Symlink /home/bbuser/blackboard -> /usr/local/blackboard file { "/home/bbuser/blackboard": ensure => "/usr/local/blackboard" }

# create NFS mountpoint (mountpoint) file { "/mnt/nfs/blackboard": path => "/mnt/nfs/blackboard", ensure => directory }

# mount NFS partition mount { "/mnt/nfs/blackboard": device => "dib.it.stedwards.edu:/mnt/san/blackboard", fstype => nfs, options => "rw,tcp,bg,intr,soft,nfsvers=3", dump => 0, pass => 0, ensure => mounted, }

package { "compat-db": ensure => installed }

# NB: Need the following: # repo for software # repo for configuration # repo for tools (snapshot, course archive, active courses)

# Ultimately: # shared Blackboard application image # refactored (local) log directory # refactored (local) configuration directory } Table 5: Puppet manifest for production Blackboard web service

There are several drawbacks of this approach. There's a learning curve associated with choosing, deploying, and maintaining a CM system, as well as the time and effort spent on documentation and training. It's possible to push a bad configuration to multiple machines so extra care is needed when making changes; however, as long as the CM client is running on the affected hosts, it's also possible to quickly recover from simple errors. There's a temptation to bypass the CM system during unplanned outages and other crises and confusion can result if operators do not understand which resources are under CM and expect their manual changes to be retained in production. These drawbacks are small compared to the benefits produced by the CM system. Production servers can be built very rapidly since most of the user intervention cost has been paid once when setting up the automated build and CM systems. Both systems act as form of documentation by formalizing the build process and by enumerating the resources managed on each host. Similarities and differences between hosts can be examined, allowing configurations to become gradually normalized. Errors in configuration can be detected and resolved when a standard configuration is pushed to all hosts. Standardizing builds and centralizing configuration reduces the need for backups on production servers. Note that the Blackboard installer requires user intervention, meaning that Blackboard is almost impossible to deploy or upgrade in an unattended, automatic fashion. Also, as of version 6.3 the Blackboard installer is a monolithic executable - it's effectively impossible to upgrade individual elements of the Blackboard infrastructure, or use the operating system's native packaging system to verify the integrity of installed files, perform trial “dry-run” installation, verify dependencies are installed, or revert back to a previous version. Configuration management systems are usually designed to install software using an operating system's native packaging system, so Blackboard's nonstandard packaging seriously interferes with the operational best practice of maintaining a known system state with a configuration management system. There are ways around this (e.g. installing Blackboard to apply database changes, then deploying the unpacked application files via a CM system or shared read- only volume, repackaging Blackboard), but again, this administrative burden could be avoided if Blackboard was distributed as an OS-specific package rather than as a unmanageable interactive executable. Once we have simplified and centralized the control over our systems and put them into production, we want to know if, when, and how they break, preferably before our users and our managers do.

III. Monitoring, Trending, and Alerting Monitoring is the art of periodically and programmatically determining the state of a system, resource, or process. Monitoring is related to two other critical activities – trending and alerting – and provides data for both. Monitoring can be performed internally or externally, and the difference between the two can be blurred by use of SNMP and remote loghosts which expose internal information. Using the example of a web server, external monitoring could examine the timing and content of the response to an HTTP request; internal monitoring could examine the number and age of httpd processes, accesses and errors logged, resource utilization such as CPU time, core and virtual memory used, rate and persistence of network connections, etc. Pragmatically, the difference between internal and external monitoring is whether a local monitoring agent is run on the monitored host. In this context, syslogd and snmpd may be viewed as monitoring agents; analogous systems exist for Windows servers. Another helpful view of monitoring is whether the state tests are spot checks (component-level) or end- to-end (functionality) tests. As an example, consider the case where the campus LDAP (centralized authentication) system is unavailable. The web- and application servers are running and are serving pages, the database is up (mounted and started), and the shared content volumes are accessible. Naïve spot checks of elements of the Blackboard system would succeed, but an end-to-end functional test (user login) would fail. From the user and management perspective, end-to-end tests are the most valuable, answering the question “Is Blackboard usable and working the way I expect?” They are necessary but not sufficient – component-level tests are necessary for an operator to isolate the nature of a problem and to direct alerts to the appropriate party. Consider the previous example of the LDAP failure. If the campus IT group is divided into network, storage, directory (LDAP), and application groups, it may be most effective for the monitoring system to page the directory group when LDAP is down and just send a status email to the application group, possibly without notifying the network or storage groups at all. It is extremely important to tune alerting systems to send the minimum notifications needed to make operators aware of problems. In studies of US pilots shot down over Vietnam and the Three Mile Island accident showed in both cases that both pilots and plant operators were faced with a large number of alerts which were either ignored or disabled because they were distracting, leading to serious consequences. As a result, the nuclear industry maintains a program to minimize ongoing “benign” alerts to reduce distraction faced by operators. Operators should be notified by multiple means (e.g. email, SMS page) and should have the ability to acknowledge an alarm to prevent further notifications when a problem is being actively resolved. By recording the results of our monitoring and alerting systems with time, we are able to evaluate trends in performance and response. These trends are vital for capacity planning, system tuning, and outage root cause analysis. We can and should monitor external services to track external dependencies - for example, the directory group in the example – or our vendors' performance (e.g. TurnItIn, ClearTxt) to verify they are meeting their service level agreement (SLA) as well as alert our users when failures occur that are beyond our control to recover. If we can programmatically detect failure, in most cases we can also programmatically recover from it.

IV. Automated Recovery Consider a simple Blackboard analysis and recovery scenario – a single server running Blackboard and its database with authentication information stored locally. We mechanize the login/logout process (i.e. write a script to simulate user interaction) and if this check fails, we restart the Blackboard web/application server process. We define failure as:

• Any page not served (response code 4xx or 5xx)

• Individual page timeout (not responding to a request within 20 seconds)

• Cannot authenticate

• Total login/logout check not completed within 30 seconds) Our recovery script should check for each of these failure modes and restart Blackboard if they are detected. As a security precaution, we will run our recovery script locally on the web/application server rather than looking for a secure way to restart Blackboard from a remote machine. The official way to restart Blackboard is to issue: /usr/local/blackboard/tools/admin/ServiceController.sh services.restart This method is unsatisfactory since often a Java process won't terminate gracefully so Blackboard processes need to be manually terminated before restarting the system: /usr/local/blackboard/tools/admin/ServiceController.sh services.stop /usr/bin/pkill -KILL -u bbuser /usr/local/blackboard/tools/admin/ServiceController.sh services.start The restart script is fairly simple, containing routines for mechanizing HTTP transactions, detecting timeouts, and stopping and starting Blackboard. It can be run as a scheduled discrete process (e.g. via cron or Task Scheduler) in which case we need to make sure the script has ended before the next scheduled invocation of the script. Alternately, we could write it as a daemon or persistent service – either way, additional complexity is necessary to ensure the restart script is reliable and does not cause new problems. Later we upgrade and our system now consists of a hardware load balancer, one or more Blackboard web/application servers, an Oracle database server, an NFS fileserver, with external dependencies on LDAPii. We have to consider a new set of failure modes.

• LDAP failures tend to be transient and when the LDAP service is recovered, the Blackboard system recovers and new users can log iniii. Organizationally, we do not have control of the LDAP server and it's treated as an external dependency.

• Since database connections are pooled and the connection pool is created when the application server starts, database problems may require a web/appserver restart after the database has recovered. Restarting the web/appserver before the database has recovered will have no effect.

• NFS issues may result in long timeouts during diagnosis and may require a web/appserver restart after the database has recovered.

• Due to routing issues it may be difficult or impossible to check load balancer health from the servers behind it. Our hypothetical restart script needs to change to address these new failure modes and new conditions under which Blackboard is restarted. Since NFS, database, and LDAP problems cannot be resolved by restarting the Blackboard service, we set the script to restart Blackboard only if the dependency tests all pass and the web login test fails. Further, if any of the dependency tests fail, we will stop checking since it is assumed that Blackboard is failed. The restart script can be used as a monitoring system, so we add logging capabilities and record the time each test takes to complete as well as each time that the script restarts Blackboard. Automated recovery is complicated by Blackboard's nonstandard and unnecessarily convoluted operator interface. /usr/local/blackboard/tools/admin/ServiceController.sh calls /usr/local/blackboard/system/build/bin/launch-tool.sh which calls Java which (apparently) calls Antiv which executes shell commands such as echo, chmod, blackboard-startup.sh, blackboard-tomcat.sh, collabserverctl.sh, and apachectl defined in /usr/local/blackboard/system/tooldefs/install/ServiceController/tool-impl.xml. To reiterate - a shell script calls four layers of indirection only to call a shell command or script. Further, Ant is a build tool; the software should have been built before it was shipped to us from the vendor, and while adding daemon start/stop commands to Ant's build script may be a nice convenience for the Blackboard developers, it's thoroughly unsuitable for production operations. To illustrate this point, let's look at the Linux Standards Base documentation on scriptsv, specifically the interface to init (startup/shutdown) scripts:“The start, stop, restart, force-reload, and status actions shall be supported by all init scripts; the reload and the try-restart actions are optional. Other init-script actions may be defined by the init script.”

In comparison, the arguments to ServiceController.sh are:

• help

• services.start

• services.stop • services.restart

• services.appserver.start

• services.appserver.stop

• services.appserver.restart

• services.webserver.start

• services.webserver.stop, and

• services.webserver.restart

So regardless of its excessive indirection (arguably a design flaw), ServicesController.sh does not accept the minimal set of standard, expected arguments and cannot be used directly as a startup script. The script has two more problems – it does not reliably stop services or kill processes, and it starts and stops multiple daemons (webserver, mod_perl application server, Tomcat (Java application server), and collaboration server.) Ideally these four daemons could be independently controlled, and in fact they are, if tools-impl.xml is any guide. Often a problem with the overall Blackboard service can be isolated to either the Tomcat or mod_perl daemon and the ability to reliably diagnose and restart just the problematic subsystem would reduce the time needed to recover the service and provide a better user experience.

One might argue that ServiceController.sh is sufficient and that startup scripts are uneccessary for Blackboard since human intervention is necessary to ensure persistent storage (NFS, database) is available when the system starts. I contend that is only valid if the webserver starts before the database server (easily fixed on a single system by specifying daemon start order.) As the local Blackboard system architecture becomes more robust (i.e. a tier of load-balanced web/application servers and a clustered or automatic-failover database), that argument becomes less valid. Since the vendor recommends running the web/application servers on dedicated hardware, a server's sole function is to run Blackboard so it's extremely frustrating to not have standard, reliable init scripts – there is no excuse for this omission considering the maturity, complexity, and cost of the system, and especially considering the small effort required to create them. The lack of init.d-compatible startup scripts is essentially a violation of the social contract between the application and the operator and target operating system. #!/bin/sh # # chkconfig: - 60 20 # description: Blackboard application server # processname: blackboard

# Get config. . /etc/sysconfig/network

# Get functions . /etc/init.d/functions

# Check that networking is up. if [ ${NETWORKING} = "no" ] ; then exit 0 fi

# Boilerplate EGREP=/bin/egrep PGREP=/usr/bin/pgrep PKILL=/usr/bin/pkill SLEEP=/bin/sleep WC=/usr/bin/wc TAG=init.blackboard LOGGER=/usr/bin/logger

RETVAL=0

# Blackboard tuning suggestion: ulimit -Hn 4096 ulimit -n 4096 start() { echo -n $"Starting blackboard services: " /usr/local/blackboard/tools/admin/ServiceController.sh services.start RETVAL=0 # hack! # RETVAL=$? echo [ $RETVAL -eq 0 ] && touch /var/lock/subsys/blackboard return $RETVAL } stop() { echo -n $"Stopping blackboard services: "

# Stop services gracefully. /usr/local/blackboard/tools/admin/ServiceController.sh services.stop

# Let things settle down... $SLEEP 3;

# Kill remaining apache webservers (80) OPT="-u bbuser -f '/usr/local/blackboard/apps/httpd/bin/httpd -d /usr/local/blackboard/apps/httpd'" NPROCS=`$PGREP $OPT | $WC -l` if [ "$NPROCS" -gt 0 ]; then echo "Webserver not cleanly killed; killing..."; $PKILL -9 $OPT fi # Kill remaining mod_perl webservers (8007) OPT="-u bbuser -f /usr/local/blackboard/apps/modperl/bin/httpd" NPROCS=`$PGREP $OPT | $WC -l` if [ "$NPROCS" -gt 0 ]; then echo "mod_perl webserver not cleanly killed; killing..."; $PKILL -9 $OPT fi

# Kill remaining tomcat appservers (8008) OPT="-u bbuser -f 'tomcat'" NPROCS=`$PGREP $OPT | $WC -l` if [ "$NPROCS" -gt 0 ]; then echo "Appserver not cleanly killed; killing..."; $PKILL -9 $OPT fi NPROCS=`$PGREP $OPT | $WC -l` echo "Brutally killing $NPROCS remaining Appserver processes..."; $PKILL -9 $OPT

# Kill remaining collaboration servers (8010, 8011) OPT="-u bbuser -f 'blackboard.collab.server.Server'" NPROCS=`$PGREP $OPT | $WC -l` if [ "$NPROCS" -gt 0 ]; then echo "Collaboration server not cleanly killed; killing..."; $PKILL -9 $OPT fi

RETVAL=0 # RETVAL=$? echo [ $RETVAL -eq 0 ] && rm -f /var/lock/subsys/blackboard return $RETVAL } restart() { stop start }

$LOGGER -i -t $TAG -- "Invoked with $1"

# See how we were called. case "$1" in start) start ;; stop) stop ;; force-reload) restart ;; status) # Probably broken status blackboard ;; restart) restart ;; *) echo $"Usage: $0 {start|stop|force-reload|status|restart}" exit 1 ;; esac

$LOGGER -i -t $TAG -- "Done." exit $? Table 6: Sample Blackboard init script for linux systems But backvi to the subject of automated recovery – we have more options available to us than writing our own recovery software. Tools such as or the suite are specifically designed to help build automated recovery tools without coding in a general-purpose language such as Perl, Python, or Ruby. Puppet has a directive to ensure certain daemons are running. There are also wholesale replacements for the init subsystem such as which serve the same purpose. A final note: Oracle offers fewer opportunities for automated recovery, though performance tuning and automatic start on boot simplify diagnosis and prevent common problems. Automated recovery tools protect the service from certain class of minor outages, but are not sufficient when faced with data corruption, loss, or hardware failure.

V. Restore and Backup A thorough treatment of data recovery, backup, and business continuity planning are beyond the scope of this presentation, however it is important to look at common restoration scenarios and backup issues in light of Blackboard's architecture, specifically its file layout. Two main file restoration scenarios are individual file recoveries (e.g. from accidental deletion) or large-scale restoration in the case of disaster recovery or business continuity planning. The simplest solution is to back up everything, but that is rarely efficient in terms of time, network bandwidth, or storage space. The challenge is to intelligently select what to back up. We can classify files into four groups: Code, Configuration, Content, and Crap. Code consists of persistent elements required for a service to complete its intended function. Configuration is persistent data that exists to convey information to Code, usually to modify its behavior. Content is persistent data that conveys information to a user. Crap is persistent or temporary data that is not classified as Code, Configuration, or Content. We may also classify files as being sharable and as being volatile. That is, in a redundant load-balanced architecture, can files be shared among identical servers or must each server have its own specific copy of a file? Volatility is a measure of how long a file remains in a specific state (i.e. how long it lasts, how often is it modified.) In terms of volatility, Content should be added, changed, or deleted more often than Configuration, and likewise we expect Code should be the least volatile of all; we don't care about Crap because we don't plan on backing it up and ideally the system functions without it. We expect that Code, Content, and some Configuration can be shared, and some Configuration will be specific to a server. Crap can be shared such as user session data or local, such as log files, cache, and temporary files. Illustration 1: Relationship between file groups, volatility, and ability to be shared

Pragmatically, we'd like to keep low-volatility shareable files (Code, Configuration) on a read-only shared volume, and shareable Content on a read-write shared volume. Machine-specific configuration would be on a local or otherwise unshared volume, separated from unshared Crap. Shared Crap would be stored separately from shared Content, Code, and Configuration, ideally not directly on a filesystem. With that in mind, let's look at the Blackboard 6.3 filesystem in a redundant load-balanced configuration. Illustrations 2, 3, and 4 shows file age in terms of color – blue files are older than one month, red files are very recent, and ages between current and one month fall on the color spectrum between red and blue. Illustration 2 shows that /usr/local/blackboard/logs is fairly volatile which is what we expect since it in the context of the application, logs are considered Crapvii. Interestingly, /usr/local/blackboard/apps/tomcat contains temp and log directories containing volatile files (Crap.) Similarly, /usr/local/blackboard/apps/tomcat/work/Catalina/localhost contains many webapps_* directories filled with temporary Java Struts files and other Crap.

Looking only at files names, searching for files with extensions bb, conf, properties, and xml, we found 40 bb files, 8 conf files, 105 properties files, and 165 xml files outside the /usr/local/blackboard/config directory tree. Outside the /usr/local/blackboard/logs directory, we found 530 log files (with log and txt extensions) and 792 tmp files. Illustration 2: File size, location, and volatility under /usr/local/blackboard (Bb 6.3) Illustration 3: File size, location, and volatility under /usr/local/blackboard/apps/tomcat Illustration 4: File size, location, and volatility under /usr/local/blackboard/apps/tomcat/work/Catalina/localhost

In short, the /usr/local/blackboard/apps directory tree contains substantial numbers of non-Code files, especially volatile Crap. Substantial numbers of Configuration files exist outside the /usr/local/blackboard/config directory tree. Substantial numbers of log files exist outside the /usr/local/blackboard/logs directory tree. There is no top-level directory for storing temporary data. Each of these issues makes it difficult to simply segregate files into local/shared, and read- only/read-write volumes. Many problems seem to be localized to Tomcat, specifically because it's using the default or naïve configuration for file locations. The Apache httpd servers have similar issues with regards to configuration files, and in both cases Tomcat and httpd can be trivially configured to direct logs to a specific directory. It is not clear if temporary files can be as easily redirected, especially for webapps and plugins. The database also contains Code, Configuration, Content, and Crap. Stored procedures are a form of Code, the data tables store contain Content and Configuration as well as a great deal of Crap. The Activity Accumulator (AA) contains audit details of user activity and is essentially a high-volume source of log information. From the perspective of the application user, this data is Crap. Normally, we'd dispose of it at the earliest opportunity except from the administrative perspective, it's an audit trail which needs to be retained for a defined period to resolve academic disputes. In this case, the AA data is Content. There would be no problem if the production rate of AA data was small, but it isn't – it grows linearly with number of users. Depending on the database configuration and the AA production rate, this log information can fill an Oracle tablespace causing the database to stop accepting new transactions and thus halting the application. This is equivalent to crashing a server because logfiles filled the root partition. One answer is to not save the AA data in the same database as Content. I question the wisdom of keeping it in a database at all. Since it is historical audit data, it should be treated as any other log information and serialized to the filesystem, compressed, and archived until its retention time expires. That application functionality is so tightly coupled to the logging system is a design flaw; logs should be written to a flat file to be replayed into the AA database once it recovers instead of halting the application. Raw filesystem and database backup and recovery are insufficient for recovering accidental deletion of grades, discussion posts, etc. because this would require point-in-time recovery capability of both the database and filesystem as well as detailed knowledge of the database schema. Far more effective is the frequently archiving of active classes. It is not difficult to generate a list of active classes by directly querying the database, or by encoding a time period in your course naming convention (i.e. FA2006 represents the 2006 Fall semester.) Periodically archiving classes stores a consistent set course data from the database and filesystem, compresses it, and saves it to the filesystem. This can be restored to a new or existing course, and can be used to recover from accidentally deleted files, gradebook entries, etc. As a side effect, the size of the archive is a rough metric of the size of the course, so rather than scanning the content directory, courses using an inordinate amount of space can be easily detected. Courses with large archives can be investigated and faculty using too much disk (usually from uploading raw media files) can be counseled on using Blackboard more effectively without setting quotas. Also, small archives indicate unused courses, so archive sizes can tell a lot about system utilization as a side benefit of protecting user content. File Size Spectrum for Fall 2007 Course Archives 7

6

5

)

]

B

k

[ 4

e

z

i

s

e

l

i

f 3

(

g

o

l 2

1

0 Illustration 5: Spectrum of Course Archive File Sizes

Finally, there's no backup script supplied for Oracle. One can be cobbled together from various sources and it's likely to be a fairly involved process, but one might long for a simpler database such as PostgreSQL, Enterprise DB or MySQL. While most threats to the integrity and reliability of Blackboard come from within, we must also consider threats from without.

VI. Security A thorough treatment of network and web application security are beyond the scope of this presentation. In addition to basic network, host, and web server security best practices, consider the following:

• Coordinate with your institution's network security group and periodically test the firewall protecting your web/application servers. If you do not have a firewall, install one and test it. If your network security policy prohibits border firewallsviii, install a local firewall on the host (iptables, etc.) and test that. Do not allow traffic from the public internet to your database server or fileserver, if applicable.

• Look for common entries in the web server error logs and group them into classes. • Install mod_security on the Apache httpd server listening on port 80. Test some of the default rulesets and build and test your own based on what was found in the error logs. As you deploy more rules, your error logs should become smaller as more unwanted traffic is blocked. This makes scanning the error logs easier as time goes on. Note: This may void your warranty, but you will sleep better at night. The risk to your production systems due to not having Blackboard support is probably much smaller than accepting known bad trafficix and you are likely to respond quicker to security threats than is Blackboard, given that security patches are bundled with bug fixes and new features, requiring a monolithic upgrade. Most institutions can only upgrade during very specific outage windows planned far in advance so even if Blackboard released upgrades as vulnerabilities were discovered, their monolithic patch/upgrade process prevents those fixes from being applied in a timely fashion.

• Install a host intrusion detection system (HIDS) such as AIDE or Tripwire. Use this to detect unexpected changes in your filesystem. This is useful for detecting volatility buried in the filesystem which visualization tools such as Philesight or KdirStat may not show.

• Independently test a non-production Blackboard installation for vulnerabilities with a web security scanner such as Nikto. Security systems can also detect serious operational problems not related to security. An example: Blackboard 7.0 had a bugx in which temporary session files were deleted but session directories were not. This led to unbounded accumulation of session directories and given the finite number of inodes (akin to entries in the file allocation table), eventually the accumulation of empty directories would consume all the inodes in a filesystem and the failure symptom would be “disk full” (unable to allocate space on disk because all inodes were consumed.) From an operator perspective, the disk would report plenty of free blocks (empty space) but an inability to allocate them (no free inodes) so it may take a while to identify the root cause. Further, it may take days to remove the thousands of empty directories consuming all the inodes. Unless you have a filesystem monitoring tool that can show volatile empty files or directories, it is very difficult to predict and solve this problem before it leads to system failure. Additionally, it may take hours or days to delete the accumulated directories. A host intrusion detection system maintains a database of file metadata – size, cryptographic signature, creation and modification time, etc., and compares the current state of a filesystem with a previous state. Differences in state are reported to the operator. If the report omits files that are expected to change (e.g. logs, temporary files and directories), it can clearly show potential security problems if files such as command binaries or libraries have changed. If one uses a HIDS to detect changes in the /usr/local/blackboard/apps directory tree, both the accumulation of empty directories and their specific locations are obvious. Using a monitoring system to show trends in inode usage and alert on low free inodes should also help detect this problem. In another instance, a commercial firm has produced a Facebook application that queries the Blackboard system of a Facebook user's school to display course information, etc., on the user's Facebook page. Given FERPA and security concerns (e.g. third-party access to authentication credentials and student information) and system reliability and capacity planning issues, some institutions consider traffic from this plugin to be abusive or unwanted even though individual requests are not intended to be maliciousxi. The Blackboard system was not designed as a web service provider and doesn't provide lightweight data access via XMLRPC or SOAP, and further, Blackboard system capacity planning does not consider traffic of that nature because the vendor code does not provide for it. - in short, the system is sized to serve typical human users, not automated third-party tools. By identifying characteristics of abusive or unwanted traffic, we can use firewalls and mod_security rulesets to drop this traffic, preserving system performance for our primary users. Finally, there are still issues regarding unsafe and broken HTML content being stored in the database. A user added HTML content containing the tag which explicitly specifies the base URL catenated to relative URLs to form full URLs. This breaks links when both viewing or editing the content so in-band (browser-based) correction is impossible – it can only be fixed by manual updating database records. Similar issues exist for HTML entities. “smart quotes”, etc. Even if a future Blackboard upgrade fixed this problem with user input filtering, what effort would be made to fix broken content already stored in the database? What other pitfalls await us?

VII. Operational Hygiene Often operational problems can be traced to aspects of system designxii. Without visibility into the Java codexiii, we can only look at the system behavior and data. Here are a few issues we've identified. The filesystem is a database, so storing lists of files within the application database is effectively redundant and this non-normalized data is by definition unable to be kept in sync. Is this a problem? A simple query coupled with filesystem tests of existence and size can give the answer. A recent test of our production LMS showed 119 missing or wrong-sized (zero-length) files out of over 100,000 files stored in the filesystem, giving roughly a 99.9% accuracy rate. “three-nines” sounds pretty good until you consider we retain courses for two years, meaning that on average we lose one file every six days. To be fair, this is a “stale cache” problem common to all applications that maintain lists of files in a database. The missing files are likely the effect of having an unreliable shared filesystem or system crashes rather than a flaw in the Blackboard application. Still, the problem is easily detected, and it's doubtful that the Blackboard application tests for this or alerts the sysadmin or user to the problem in any meaningful way. Even if the problem can be attributed to hardware or non-application software faults or operator error, the user will report this as a “problem with Blackboard.” In the same way that management would rather hear about problems from staff than from users, sysadmins would rather detect problems before they were found by users. This leads to another interesting question: how do we know that all the files in the content areas of the filesystem are considered content by Blackboard? Are there orphaned files in the filesystem that cannot be accessed by users? How do we find them and what do we do with them? What fraction of sites are running Blackboard virtual installations? My guess is that the fraction is small. If so, why does the snapshot controller, et. al., require the virtual installation to be specified? If there's only one, couldn't the application tell that and do what's expected of it? Solution: write wrapper scripts that obtain the virtual installation data from the database (or less ideally, from a configuration file.) Why are database credentials stored in the database? I suppose this might make sense in a large-scale hosting scenario, but in a single institution environment it's baffling and unnecessarily complicates setting up test-dev servers, testing database restores, and disaster recovery. A long-running daemon process should be able to determine if a temporary file is still valid or necessary and delete it if it isn't. Even without the use of scheduled jobs, a maintenance thread could periodically reap stale or unused cache, session, or otherwise stale files.

Linux systems use logrotate to compress and rotate system and application logs. Rather than using an opaque and immutable custom Java-based log rotation program, why not provide a simple set of logrotate directives instead? It's less code to write and maintain, it gives the local administrator more control and flexibility as to how and when logs are rotated, how long they are retained, etc. As provided, the log rotation tool is worse than useless because it doesn't handle all logs, leaving Crap to accumulate in the filesystem.

Foreword to the Future There are many aspects of Blackboard that remain to be analyzed, there are many issues that remain unresolved and it's not clear if or when they will be resolved. Certainly some have been resolved in later versions, but operational issues are seem to take a far back seat to bug fixes and new features. While there are limited resources and infinite desires, the age, importance, and cost of the Blackboard make many of the issues described inexcusable. They directly affect the system's total cost of ownership (TCO) and system reliability and need to be fixed. There are too many problems to address piecemeal through the feature request process, and attempts to have them addressed or even acknowledged through other channels has been fruitless. The Blackboard system needs an operational overhaul with an eye to system administration best practices and published standards. If a solution will not come from the vendor, it must come from the community.

About the Author Bob Apthorpe has been developing web-based software since 1995 and administering large-scale web systems since 1999. In 2002 he was hired to maintain Blackboard and other web systems at St. Edward's University in Austin, TX. He has a B.S. and M.S. in nuclear engineering from the University of Wisconsin, and from 1992 to 1995 was a safety analyst at a nuclear power plant and served as the chair of the Information Technology Review Board to manage the qualification of safety-related software. He is currently on the board of directors of the League of Professional Systems Administrators (LOPSA)

Appendix I: Software References All material developed for this presentation will be posted at http://lmstools.cynistar.net/

Revision Control Systems Concurrent Versions System (CVS) - http://www.nongnu.org/cvs/ Subversion (SVN) - http://subversion.tigris.org/ svnX - http://www.lachoseinteractive.net/en/community/subversion/svnx/features/ TortoiseSVN - http://tortoisesvn.tigris.org/

Configuration Management Systems bcfg2 - http://trac.mcs.anl.gov/projects/bcfg2 Cfengine - http://www.cfengine.org/ g4l - http://sourceforge.net/projects/g4l g4u - http://www.feyrer.de/g4u/ Jumpstart - http://www.sun.com/bigadmin/features/articles/jumpstart_x86_x64.jsp Kickstart - http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en- US/RHEL510/Installation_Guide/ch-kickstart2.html lcfg - http://www.lcfg.org/ Puppet - http://reductivelabs.com/projects/puppet/ Ultimate Deployment Appliance - http://www.ultimatedeployment.org/uda/index.html VMWare - http://www.vmware.com/ Xen - http://www.citrixxenserver.com/Pages/default.aspx

Monitoring, Trending, and Alerting Big Brother - http://bb4.com/ Cacti - http://cacti.net/ Munin - http://munin.projects.linpro.no/ Nagios - http://www.nagios.org/ RRDTool - http://oss.oetiker.ch/rrdtool/ Servers Alive - http://www.woodstone.nu/salive/ Splunk - http://www.splunk.com/ Zenoss - http://www.zenoss.com/

Automated Recovery daemontools - http://cr.yp.to/daemontools.html launchd - http://developer.apple.com/macosx/launchd.html Monit - http://www.tildeslash.com/monit/

Restore and Backup Backup Central - http://www.backupcentral.com/

Security mod_security - http://www.modsecurity.org/ Nikto - http://www.cirt.net/code/nikto.shtml Wireshark - http://www.wireshark.org/

Operational Hygiene Jad - http://www.kpdus.com/jad.html LambdaProbe - http://www.lambdaprobe.org/d/index.htm Philesight - http://zevv.nl/code/philesight/ SequoiaView - http://w3.win.tue.nl/nl/onderzoek/onderzoek_informatica/visualization/sequoiaview/ i Except for the single host running the collaboration server. ii Not counting standard dependencies such as DNS, NTP, networking, power, HVAC, etc. iii Interestingly enough, loss of LDAP does not affect logged-on users. iv Apache ant (http://ant.apache.org/) is a Java build tool analogous to make. v See http://refspecs.linux-foundation.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html or for Solaris, see http://docs.sun.com/app/docs/doc/819-2379/6n4m1vlpt?l=en&a=view#fahqr vi A compelling argument could be made that we never left the subject; provided all dependencies were met, an init script would automatically recover Blackboard in the case of server reboot. vii The “four-C” groupings are dependent on context; for example, on log hosts system and application logs are Content, while on production servers logs are Crap. Are plugins Code or Content? Discuss. viiiSadly, this happens. ix If our experience is any guide, those sending malicious traffic to our systems will be around a lot longer than our Technical Support Manager. x Since fixed. xi I also contend that firms which reap a financial benefit while externalizing the cost of content production and distribution to our institution should repay us some fraction of that cost. Further, I believe students should be made aware of this economic imbalance with respect to for-profit plagiarism detection firms, but that is not germane to this discussion. xii This is perhaps an understatement. xiiiWithout a Java decompiler such as Jad or a Tomcat monitor such as LambdaProbe. Using either will probably violate your license agreement so let your conscience (or legal staff) be your guide.