The Seven Pillars of Operational Wisdom: Selected Topics in Blackboard System Administration
Total Page:16
File Type:pdf, Size:1020Kb
The Seven Pillars of Operational Wisdom: Selected Topics in Blackboard System Administration Abstract The Blackboard Learning Management System presents many challenges for the system administrator due to its complexity. Some of these challenges are common to production web applications while many others are results of the design and implementation of Blackboard. Seven key areas of system administration are explored in terms of both theory and practice and illustrative problems, root causes, methods of diagnosis and solution (if any) are discussed. The areas are version control, configuration management, monitoring, automated recovery, restore and backup, security, and “operational hygiene” -- other Blackboard-specific operational issues. The tools and techniques described are for a load- balanced Enterprise system running on linux, but much of the underlying theory and practice applies to all platforms. Preamble, Caveats, and Intent This work is based on a few assumptions about the audience's goals and understanding of system administration in general and the design of robust, scalable web systems in particular. First, institutionally we are constrained by budget and knowledge – we cannot be expected to have enterprise Oracle or Java web application design and analysis experience, and we cannot afford to throw money (hardware, consultants, diagnostic tools, certifications) at problems to solve them. Our customers and management have high expectations for reliability and as administrators we will do everything reasonable to meet those expectations, including remote and on-site administration at all hours. Our goal is a to maintain a cost-effective and reliable learning management service that minimizes the consumption of our limited time, budget, and patience. I. Revision Control The notion of observability and repeatability are just as important in system administration as they are in the physical sciences. Similarly, consistency of manufacture is as important when producing a physical product as when deploying a new server. In software development, a revision control system or repository is used to maintain an audit trail or changelog of source code as well as a means of retrieving past versions of the code. “Code” typically refers to textual data but can refer to binary objects as well such as images, audio files, etc. In terms of Blackboard administration, a revision control system provides us with the means to store, annotate, and retrieve configuration files. New Blackboard systems are usually deployed after an iterative process of testing and reconfiguration, with the final tested configuration placed into production. In a horizontally-scaled (load-balanced) architecture, it's vital to deploy an identical known good configuration to all hosts in the web/application tieri. Further, some problems may not be found immediately, so it's important to be able to revert changes back to a known good configuration. We can achieve both goals of deploying identical configurations and reverting changes back to a known good state by keeping configuration files in a revision control system. Another benefit of using a revision control system is that it gently formalizes the change process. Rather than logging into a production server and directly editing a configuration file, a working copy is checked out from the repository to the local host, edited, and checked in or committed back to the repository with an annotation explaining the reason for the change. The new version is then checked out on the production servers. If there are any problems, the changes can be reverted by checking out the previous approved version of the file. The file on the production servers can be compared to what's stored in the repository to show differences, the changelog will show the history of changes to the file, and a semblance of observable and repeatable change management will have been achieved. The administrative overhead of deploying a revision control system is small, considering the two most common revision control systems are well-documented, free, and supported in most major text editors and integrated development environments (IDEs) on all major platforms. Setting up a repository on a server is fairly straightforward, simplified by the availability of high-quality documentation and binary executables for all major server platforms. We have found the largest cost has been in time spent training staff to adapt to the new workflow. As useful as a revision control system is, it is only the first step in managing server configuration. There are many other aspects of server configuration that are too complex to manage with only a revision control system. II. Configuration Management Configuration management (CM) is the practice of positively controlling aspects of a host's local configuration. Configuration resources that can be managed include: • configuration files • users and groups • file permissions and ownership • mounts and mount points • software (installation, removal, version) • network interfaces • daemons and services (running and disabled) Server configuration is initially set at build time and is kept under control over the server's lifecycle by a CM system installed at server build time. Initial configuration is controlled by using an automated server build system such as Jumpstart (Solaris) or Kickstart (Red Hat Linux, CentOS) or, alternately, by using a disk imaging system such as Norton Ghost, g4l, or g4u. Disk imaging systems are not recommended since they require manual partitioning of disks and are not as robust as automated build systems. Unattended system builds can be performed by using a DHCP server on an isolated build network, PXE (network) boot, a software repository (often an installation CD copied into the document root of a webserver), and a build configuration file. # Kickstart file automatically generated by anaconda. install url --url http://repository.stedwards.edu/centos/4.4/os/i386/ lang en_US.UTF-8 langsupport --default=en_US.UTF-8 en_US.UTF-8 keyboard us xconfig --card "ATI Mach64 3D Rage IIC" --videoram 4096 --hsync 30-81 --vsync 56- 76 --resolution 800x600 --depth 16 network --device eth0 --bootproto dhcp rootpw --iscrypted $1$.qwertyuiopasdfghjklzxcvbnm1234 firewall --disabled selinux --disabled authconfig --enableshadow --enablemd5 timezone --utc America/Chicago bootloader --location=mbr # The following is the partition information you requested # Note that any partitions you deleted are not expressed # here so unless you clear all partitions first, this is # not guaranteed to work clearpart --all --drives=sda part /boot --fstype ext3 --size=100 --ondisk=sda part pv.10 --size=0 --grow --ondisk=sda volgroup VolGroup00 --pesize=32768 pv.10 logvol swap --fstype swap --name=LogVol01 --vgname=VolGroup00 --size=1000 --grow --maxsize=2000 logvol / --fstype ext3 --name=LogVol00 --vgname=VolGroup00 --size=1024 --grow %packages @ dialup @ mail-server -dovecot grub -spamassassin kernel-smp e2fsprogs lvm2 # Add nice things to have lynx zsh # Remove unnecessary bloat on server -bluez-bluefw -bluez-hcidump -bluez-libs -bluez-utils -finger -irda-utils -isdn4k-utils -mysql -NetworkManager -pcmcia-cs -up2date -wireless-tools %post /sbin/chkconfig cups off /bin/rpm -i http://maxie.it.stedwards.edu/depot/aide/aide-0.11-1.rf.i386.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/ruby/ruby-1.8.6-3.i386.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/puppet/facter-1.3.7-1.noarch.rpm /bin/rpm -i http://maxie.it.stedwards.edu/depot/puppet/puppet-0.22.3-1.noarch.rpm # update /etc/sysconfig/puppet with proper puppet server Table 1: Example kickstart.cfg file Alternately, the same process could be used to create a standard image of a virtual machine to be run under VMWare, Xen, or an equivalent system. What matters is that known good initial configurations can be repeatably produced. Once one has defined a standard, minimal operating system installation including a configuration management system client, it is possible to use the CM system to customize a generic server to its specific role and keep its configuration under positive control. There are many free, production-quality CM systems available for unix systems (bcfg2, cfengine, lcfg, puppet.) In our case, we deployed Puppet on unmanaged servers and gradually normalized the configurations. This allowed us to gracefully introduce configuration management and see immediate benefits as we learned more about Puppet. By storing Puppet manifests (configurations to be applied to a target system) in a revision control system we were able to deploy and revert configurations without logging into the target systems and were assured that configurations would be kept consistent. Any local changes made to a resource managed by Puppet would be overwritten by the canonical configuration in a few minutes and a log of the change would be made. node 'zim.it.stedwards.edu' { include aide include apcupsd include base include blackboard_web_tidy include dependency_crond include dependency_dns include dependency_ldap include dependency_ntpd include dependency_sendmail include dependency_sshd include diagnostics include ntp include mail_aliases include munin_node include puppet_client include resolver include sendmail_cf include service_monit include shells include software_depot_client include sudo include syslog include users_blackboard include users_compserv include users_itec include users_monit include users_munin include utility_philesight } Table 2: Puppet node description for Blackboard