55 A HIGH PERFORMANCE LDM DATA CLUSTER

Robert . Lipschutz, David Hagerty*, Paul Hamer, Peter Lannigan*, and Chris MacDermaid Cooperative Institute for Research in the Atmosphere (CIRA) Colorado State University, Fort Collins, Colorado USA, and NOAA Earth System Research Laboratory (ESRL), Boulder, Colorado USA

*NOAA Earth System Research Laboratory (ESRL), Boulder, Colorado USA Contract with Riverside Technology, Inc., Fort Collins, Colorado USA

1. INTRODUCTION Data Manager (LDM) have enabled us to construct a high performance real- data Within the NOAA Earth System Research processing environment that serves the GSD Laboratory (ESRL), the Global Systems Division community. Given the central role of LDM in the (GSD) develops weather information systems, architecture, we particularly focus on the how weather forecast models, and other applications LDM provides cluster services for data transport, in support of the National Weather Service, the and on the use of event notifications in LDM. Federal Aviation Administration, and other agencies. Well-known GSD products include the 2. LIMITATIONS OF THE HA ARCHITECTURE Rapid Update Cycle (RUC) model, the Local Analysis and Prediction System (LAPS), the While GSD's HA pairs have operated Meteorological Assimilation Data Ingest System successfully for a number of years (MADIS), and Science On a Sphere® (SOS). A (Lipschutz and MacDermaid, 2005), and provided common feature of these and other projects is robust availability through their automated fail- that they require observational and model data over mechanism, a major drawback has been the provided by acquisition systems running within need for two machines, one of which is mostly GSD’s Central Facility. These systems handle idle, to support a single processing configuration. some 800 GBytes of incoming data per day as Thus, system expansion or refreshment they acquire, decode, store, and distribute the necessitated acquiring and supporting two new needed data sets for GSD scientists and their hosts a time, adding to systems administration collaborators. effort, power and cooling burden, and cost. In addition, while several HA pairs initially may have To substantially improve its data systems, shared common system configurations, the GSD recently commissioned a six-host cluster. individual nature of the systems resulted in drift in This new -based system replaces a the configurations over time that became collection of aging Linux High-Availability (HA) problematic to maintain. pairs and stand-alone platforms to provide much needed scalability and throughput performance, At the application level, needing to individually as well as excellent reliability, resource utilization configure each host's functionality was tedious. It and configurability. often forced choices for where to locate processing based on where prerequisite data In this paper, we detail the implementation of existed, rather than where system resources were the cluster architecture, describing how the available. This resulted in both unbalanced combined services of Red Hat Cluster Suite, Sun processing loads and complicated data Grid Engine (SGE), fcron, and Unidata's Local flows. Further, limited local disk space on the HA systems hampered processing of the ever larger Corresponding author address: Bob Lipschutz, data sets that need to be processed. NOAA/ESRL/GSD, R/GSD2, 325 Broadway, Boulder, Documenting the processing for a data set and CO 80305. [email protected] troubleshooting data outages was complicated by • Energy efficiency – higher efficiency in CPU the need to identify the specific platforms where utilization means less heat generation, and so processing elements were performed. lower cooling requirements compared to the HA computing infrastructure. 3. GOALS FOR THE CLUSTER In addition to these general goals, we strongly In planning for the cluster, the intent was to desired to retain as much as possible of our long- address the HA systems' limitations with these established application software, thus minimizing benefits: code and configuration changes. We further recognized a continuing need, well served by the • Scalability – processing capacity can be HA pairs, to be able to onto separate added one machine at a time. platforms such high volume data streams as those from the NWS Satellite Broadcast Network • Throughput – load balancing mechanisms (SBN - NOAAPORT) and the WSR-88D Level-II allow for much greater aggregate throughput radars, and thereby use independent processing than was achieved on hand-configured resources to minimize resource contention. systems. Of course, throughput also improves using newer, multicore processors and faster 4. SOFTWARE COMPONENTS networking and storage devices. To achieve our goals, we implemented the • Utilization – cluster hosts are more cluster with these software components: effectively used than HA pairs in that half the hosts are not virtually idle backups. Rather, all • the CentOS Enterprise Linux operating cluster nodes are active, and load balancing system, methods ensure that jobs are distributed sensibly. In addition, the cluster’s shared • the Red Hat Cluster Suite for cluster-wide storage arrangement provides space to all application services and failovers, cluster applications, whereas in the HA architecture, spare local disk space on one • Sun Grid Engine (SGE) for job activation host is unavailable to another host that might and load balancing, need space. • fcron for cluster-wide time-based job • Maintainability – cluster hosts share a triggering, common system configuration, substantially simplifying ongoing system administration. • Unidata's Local Data Manager (LDM) for Host reboots and failovers for patching also data transport and event-based job triggering, are easier than with the HA arrangement. On and the application side, multiple configurations are replaced by a single, common set of • Open-E storage software for controlling the configuration files. In addition to reducing the shared (via NFS) Data Storage Server (DSS) number of files to maintain, this also RAID disk system. eliminates duplicate configuration items across hosts. Within this framework, GSD's Object Data System (ODS) applications (Hamer, 2005) are • Supportability – application data and log responsible for such data processing tasks as files reside on the shared storage, eliminating converting GOES Variable (GVAR) satellite data, the need to know which specific host ran a Gridded Binary (GRIB) model data, WSR-88D job. Level-II radar data, and a variety of point observation data types into the netCDF formats needed by GSD user applications. Only minimal pqact configuration, along with the command to changes to the existing ODS software were be run. Given a desired notification, pqact pipes needed to accommodate the cluster's the notification message to a script, architecture. In addition, two new scripts were qsubOnTextNty.pl that extracts the GRIB 's developed to provide common methods for path and then submits the specified command to submitting jobs to the SGE job queue from LDM the SGE job queue for execution. For instance, (pqact) and fcron, while several old scripts were the ODS Grib2NetCDF application may be run to extended to include a cluster-wide locking convert a GFS model GRIB file to netCDF format. mechanism to avoid concurrent instances of some jobs. Modifications to LDM scripts were also Point data tar files can be subsequently read needed, as described in Section 5.4. by the Point2NetCDF program to decode the raw data and create netCDF files, for example 5. LDM ON THE CLUSTER containing METAR or maritime observations. These jobs, typically run on a schedule, are LDM has long served as the key middleware submitted to the SGE queue by fcron using the in GSD for real-time data transport and event- script qsubCmd.pl. driven data processing services. In preserving the same LDM data processing capabilities used on The key extension to ODS for the cluster was the HA systems, the cluster architecture to enable specification of separate LDM queues compartmentalizes data streams that deserve to for data services and notifications. Thus, the ODS be separated, but also shares across hosts the clients process data from an ingest queue and data and processing that can be accommodated insert notification messages into a local on any host. To accomplish this, we differentiate notification queue. data flow services from event handling based on data arrival “notification” messages.

5.1 LDM and ODS

In a manner similar to pqact, ODS client applications process desired data messages by attaching directly to the LDM data queue. For example, the LdmGrib2Flat client captures GRIB records in the NOAAPORT stream. The client writes the data into forecast-hour GRIB files that it opens in dynamically determined directories using the center, sub-center, model, and grid parameters found within the GRIB. Likewise, the Figure 1. A sample ODS data arrival notification ODS LdmPoint2Tar client writes selected point message. data messages into hourly tar files by data , and LdmNexrad2TarGZ streams Level-II WSR- 88D data into gzipped tar files by volume scan time for each radar station. 5.2 LDM Data Services

On completion of a GRIB file, LdmGrib2Flat To independently handle various incoming emits a uniquely identified data arrival notification and outgoing data streams, a number of LDM message indicating the location on disk of the instances are configured under Cluster Suite to newly available file (Fig. 1). For each GRIB file run as relocatable cluster services. These that requires subsequent processing, the services are normally set to run on different hosts, associated notification identifier is specified in a but two or more can also run concurrently on a single host, if necessary in a failover situation. 5.3 The Notify LDM and SGE The specific hosts on which each service may run is determined within the cluster by defining The Notify LDM (ldm-notify) operates Failover Domains, configured based on a desire independently on all the cluster hosts to initiate to reasonably balance data volume across the processing on receipt of data arrival notification cluster. messages. These Notify LDM instances start up at host boot time using the local host IP address After considering the known data flows and identical ldmd.conf and pqact.conf targeted for the cluster, we established the configuration files on each host. While processing following five LDM services: for many files simply entails copying from the shared storage to GSD’s public file server, some ldm-noaaport – ingest service for all files require additional format conversion, image NOAAPORT data remapping, or other post-processing. After each processing step, a new notification message is ldm-radar – ingest service for WSR-88D inserted into the Notify queue to provide a trigger Level-II and WSI NOWrad radar data for yet another step.

ldm-other – ingest service for GOES GVAR The key to the cluster's load-balancing and miscellaneous data types architecture is this: rather than simply running jobs on the host on which a notification was ldm-outbound – distribution service for emitted, jobs are submitted to SGE, which finds sending cluster data to other GSD hosts the least loaded host for the job. Since data files on the shared storage are visible to each host, ldm-fdr – Facility Data Repository service for and Notify LDM is configured identically on all saving data to GSD’s Mass Store System. hosts, processing is thus dynamically distributed across the cluster members. If a member leaves A non-LDM sixth service, MainIP, establishes the cluster, say for OS patching, the remaining the cluster's canonical IP address and runs the hosts pick up the work that would have been single fcron daemon instance. assigned by SGE to the missing host. Each LDM service establishes a unique IP 5.4 LDM Scripting and Configuration address in addition to its own queue, pid, and log files, enabling multiple rpc.ldmd services to To manage the data service and Notify LDM coexist on a host. Thus, an external LDM may instances in a shared environment, several request data from the ldm-outbound service accommodations were needed. without knowing the particular host it happens to be on. First, we established each instance's home path on shared storage as /usr/local/ldm/ldm- As an LDM ingest service receives data, its , where instance is noaaport, radar, associated pqact and ODS client processes are other, outbound, fdr, or notify. Since the data responsible for writing the data to the shared services only run on one host at a time, the storage. On successful completion of data files, ldmd.pid file for each service instance lives in its the various data writing methods then emit instance home. Similarly, the ldmadmin-pl.conf, notification messages to signal the events. And, ldmd.conf, and pqact.conf files reside in ./etc rather than inserting the notifications back into directories under their respective home paths and their own service queues, the writers instead put can be managed from any cluster host; log files the notification messages into a local “Notify LDM” reside under ./log, and are visible from any host, queue. as well. In contrast, to run independently on all hosts, the Notify LDMs require their ldmd.pid, pqact.conf, and ldmd.log files to reside locally on queue for the instance; similarly, stopping a each host. Thus, the Notify ldmd.pid files are service will unmount the partition. Using tmpfs for written to the local path /var/run/ldm, while the the queue ensures the highest possible Notify pqact.conf files reside under /var/ldm/etc to throughput, and guarantees that an LDM queue is accommodate unique pqact 'state' files on each not corrupted on failover or reboot. host, and ldmd.log files are written into local /var/ldm/log directories. Finally, another ldmadmin modification forces each LDM instance to to a specific log file, To control the LDM instances, we extended thereby overriding LDM's default syslogd logging the standard ldmadmin perl script. The modified method. This strategy avoids the problems that ldmadmin looks for a shell , would arise from trying to handle cluster-wide LDMINSTANCE that determines the path for the logging through the host-based syslogd facility. It associated ldmadmin-pl.conf, which in turn sets also eliminates needing to build and maintain the various instance-specific LDM parameters; separate instances of LDM to support without a valid LDMINSTANCE, ldmadmin exits independent logging. Because LDM does not with an error message. We further facilitated the presently have a method for closing physical log use of instances by developing a new script, files (short of restarting), we also extended LDM's ldmadmin-wrapper, to which we linked ldmadmin- newlog utility to “zero out” the currently open log files. In addition to ensuring that the file after rotating the previous logs down by one instance environment variable is correctly set, this version number. cluster-aware wrapper also prevents inappropriate commands from being run. In particular, because 6. LOGICAL FLOW the data service instances may only be started and stopped by the Cluster Resource Manager to While there are numerous scenarios for ensure cluster consistency, ldmadmin-wrapper processing the many types of data handled by the restricts which options may be run from the cluster, the following steps describe a typical flow, command line. Thus, 'ldmadmin-noaaport start' as depicted in Figure 2: will with a message listing the valid options that may be run (e.g., pqactHUP); 'ldmadmin- • On system startup, ldm-notify is started on noaaport watch' on the NOAAPORT ingest all cluster hosts. service host is valid and will display incoming message information, while on a different host it • The ldm-noaaport service is started by will simply indicate on which host that service is Cluster Manager on dc01 and begins located and then exit. receiving NOAAPORT data from its upstream source; meanwhile, the ldm-outbound service We also needed to accommodate starting and starts on dc03 and accepts connection stopping the LDM instances by automated requests from external clients. methods. Specifically, the Cluster Resource Manager is responsible for data service activation, • The LdmGrib2Flat client on dc01 reads from while the Notify LDM is activated at host boot the ldm-noaaport queue and writes GRIB data time. To provide that function, we implemented an into shared directories under LDM resource script, /etc/init.d/ldm, again linked /data/grib/noaaport. by instance versions (e.g., ldm-noaaport), to cleanly start and stop LDM instances with the • On completion of a GRIB data file, correct environment. Whereas a typical LDM LdmGrib2Flat inserts an resource script will check for a viable LDM queue LdmGrib2Flat_NOAAPORT notification into and repair it if necessary, the cluster's version the local ldm-notify queue. extends that concept by mounting a new tmpfs • partition at startup and creating a new in-memory pqact on dc01 identifies the notification message as one that requires a Grib2NetCDF • If necessary, another job may then be action, and submits the configured job into the activated on the cluster, for example to send SGE queue. the netCDF file into the ldm-outbound queue for delivery to an external client. • SGE finds the least loaded host in its list of available cluster hosts, and runs the As the activity driven by the data services Grib2NetCDF job on that host, say dc02. proceeds, many time-based ingest actions, particularly ftp fetch jobs, are initiated by the fcron • Grib2NetCDF on dc02 reads the specified service. Like pqact jobs, the fcron jobs are also GRIB file and creates a NetCDF file in the submitted to SGE, and so distributed evenly configured directory under /tmp_data/grids. across the cluster. In addition to acquisition tasks, When finished, Grib2NetCDF inserts a fcron jobs also perform many housekeeping tasks notification message into the local ldm-notify such as file purging and monitoring. Job activity queue on dc02. on the cluster is monitored by the SGE qstat utility (Fig. 3). • pqact on dc02 identifies the notification message as one that requires a 7. CONCLUDING REMARKS moveTmpDataFiles action to copy the file from the /tmp_data/grids location to the The cluster system now operating in GSD’s desired public server location, and so submits Central Facility establishes a reliable, capable, the configured job to SGE. and scalable new architecture for meeting substantial and ever-growing data ingest and • moveTmpDataFiles is started by SGE on processing requirements. While the learning curve the least-loaded node, and on completion traversed to successfully implement this system inserts a new notification message in the ldm- was relatively steep and, at times, arduous, the notify queue on that node. resulting system has been well worth the effort.

Figure 2. Cluster logical flow. Currently, nearly 200 pqact entries are configured to handle specific data arrival events, and fcron manages some 230 different time- based jobs. Driven by the data flow and the work to be performed, we are presently logging over 160,000 jobs submitted to the cluster’s SGE queue each day. We expect these numbers to grow as new requirements are added by our GSD data consumers.

8. REFERENCES

Hamer, P. (2005), FSL Central Facility data systems concepts, presented at 21st Int. Conf. on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, American Meteorological Society, San Diego, CA, 2005.

Lipschutz, R. and C. MacDermaid (2005), Recent advances in the FSL Central Facility data systems, presented at 21st Int. Conf. on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, American Meteorological Society, San Diego, CA, 2005.

Figure 3. SGE ‘qstat’ output showing jobs distributed across the six cluster hosts based on system load. Job names (column 3) are descriptive of the work being done.