Cyberinfrastructure Usage Modalities on the Teragrid

011 IEEE International Parallel & Distributed Processing Symposium Cyberinfrastructure Usage Modalities on the TeraGrid Daniel S. Katz David Hart Computational Institute Computational & Information Systems Laboratory University of Chicago & Argonne National Laboratory National Center for Atmospheric Research Chicago, USA Boulder, USA [email protected] [email protected] Chris Jordan Amit Majumdar Texas Advanced Computing Center San Diego Supercomputer Center University of Texas Austin University of California San Diego Austin, USA San Diego, USA [email protected] [email protected] J.P. Navarro Warren Smith Computational Institute Texas Advanced Computing Center Argonne National Laboratory & University of Chicago University of Texas Austin Chicago, USA Austin, USA [email protected] [email protected] John Towns Von Welch National Center for Supercomputing Applications Indiana University University of Illinois Bloomington, IN Urbana, USA [email protected] [email protected] Nancy Wilkins-Diehr San Diego Supercomputer Center U. of California San Diego San Diego, USA [email protected] Abstract—This paper is intended to explain how the TeraGrid storage, and specialized computing systems. These would like to be able to measure “usage modalities.” We would combinations of integrated resources make possible a variety like to (and are beginning to) measure these modalities to of user and usage scenarios. To better support our user understand what objectives our users are pursuing, how they community, TeraGrid wants to understand how users are go about achieving them, and why, so that we can make actually using these resources to achieve their scientific changes in the TeraGrid to better support them. objectives, so that we can make changes in the TeraGrid environment to improve operations and services. Because Keywords-component; production grid infrastructure; TeraGrid will be transitioning into a new project (eXtreme cyberinfrastructure Digital, or XD) in mid-2011, much of this data gathering will be given to the new project and can be used in planning its I. INTRODUCTION growth. The TeraGrid represents one of an emerging class of A great deal of attention has gone into the study of HPC entities that can be referred to as “production workloads, with the goals of improving scheduler cyberinfrastructures.” These cyberinfrastructures, which also performance and maximizing resource utilization. A study of include the Open Science Grid in the US, DEISA and EGI in the TeraGrid’s workload patterns shows that, in many ways, Europe, and RENKEI in Japan, harness distributed the TeraGrid’s HPC resources demonstrate many of these resources, including high-performance computing (HPC), same patterns, both on individual systems and across the federation [1]. 530-2075/ $26.00 © 20 IEEE 93927 DOI 0.09/IPDPS.20.239 However, while standard workload analyses can describe instrumentation for the characteristics we cannot yet what users are doing on TeraGrid’s HPC systems, they measure. cannot easily be used to understand why users do what they do and how they leverage multiple types and instances of III. GENERAL INSTRUMENTATION AND MEASUREMENT cyberinfrastructure (CI) resources. We have labeled these REQUIREMENTS generalized, CI-spanning classes of user activity “usage The first implicit requirement in the characteristic-based modalities” and have begun efforts to instrument TeraGrid definition of modalities is the ability to measure each infrastructure services so that we can conduct quantitative characteristic in terms of a common set of user identities. analysis of user behavior in terms of these modalities. That is, it must be possible to know that a given user in the The paper is organized as follows. We first define the measurement for one characteristic is the same as a given modalities we think are important to measure. We then user in the measurement of a second characteristic. Without discuss what is possible to measure in general, and on the this identity linkage, we are left only with independent TeraGrid. We next explain the characteristics of the measures of each characteristic. We would be unable to modalities we have selected: user intent, when-to-run, make assertions about which modalities, as defined by the submission mechanism, resources, job coupling, support, and intersection of those two characteristics, are most prevalent. level of software development, and present some early Second, these characteristics must apply to a common measurements we have taken, for allocations questions asked unit of usage, which currently in the HPC-oriented world of during June 2010 through January 2011 and from jobs run TeraGrid is a batch-queue job. A job is most often measured during Q4 (October to December) 2010. We finally discuss in two dimensions: time (duration) and space (size, cores, what else we would like to do, both internally on TeraGrid etc.). Most commonly, a job is a single run of an application, and with other infrastructures, and then conclude. but a job may also be part of a larger ensemble of jobs, or it could be part of a workflow consisting of a set of II. DEFINING MODALITIES sequentially dependent jobs, or a job could be one of a set of Our first challenge was defining what modalities we jobs that together comprise an MPI application that is run needed to measure. Initially we attempted to enumerate all across two distinct clusters. In some cases, a job may also be possible modalities, but it quickly became apparent that this an interactive session, for example, on a visualization approach was insufficient, as it quickly became a lengthy list system. To accurately understand usage modalities, we must that was hard to group into useful elements. Rather, we be able to implicitly infer or explicitly annotate these sorts of concluded that we needed to define a matrix of modalities, relationships between jobs. with each modality defined by a set of possible values for a Thus, understanding usage modalities requires tracking number of common usage attributes. These attributes are not usage by user, by job, and by job usage (time x space). Most necessarily continuous dimensions, so a user’s modality is HPC jobs are measured in terms of core-hours (number of more correctly defined as an attribute-value “tuple” within a hours x number of cores). TeraGrid uses the concept of tuple-space that describes a usage behavior: how a user is “normalized units” (NUs) as a unit of compute usage that using TeraGrid at a particular time. Thus, these modalities permits usage comparisons across heterogeneous systems. suggest the session classes that can be inferred from HPC job One NU is equivalent to a Cray X-MP core-hour and the logs [2], but represent a more complex construct, potentially conversion from a system’s local core-hours to NUs is spanning many types of CI resources. calculated based on its HPL benchmark performance. As an The attribute-value approach to defining a modality example, one core-hour on either Ranger, the Sun offers several advantages. First, in addition to a fully defined Constellation Cluster at TACC equals about 35 NUs, or on a matrix, it also allows us to express modalities at different modern desktop PC, about 45 NUs. levels of granularity. We can express, for example, the set of Storage consumption is measured by terabyte-year modalities defined by only two characteristics while ignoring (TB/yr) or gigabyte-year (GB/yr), intended to encapsulate other attributes that are not important to the analysis at hand. both the amount of space consumed and the time that data We can then refine the analysis, by incorporating additional spends resident on storage devices. For purposes of attributes as needed. allocating storage, we only recognize usage in units of a Second, the challenge of measurement becomes much year; this presents challenges in measuring user modalities more tractable. To measure a modality, TeraGrid needs only including scratch or temporary storage. Therefore, there are to be able to measure the set of characteristics that express two metrics of interest used to measure usage in this context: that modality. The converse is also true; each of the modality a simple absolute value of storage used, which is used to characteristics must be defined as measureable properties for measure data moved or generated in association with a which instrumentation could, in theory, be implemented. computation or a visualization/analysis task, and which may Characteristics that cannot be measured are not useful be assumed to be temporary; and a terabyte-year metric, discriminators of different modalities. It also makes it which is used to measure persistent data stored by TeraGrid possible to measure some modalities without having to be users. These measures are applied to whole resources based able to measure all modalities. That is, the characteristics we on the usage model defined for those resources. For example, can measure define some parts of the overall modality tuple- because an archive storage resource is not used for space, which we can usefully study while designing the temporary storage, usage will always be measured in terabyte-years. 932928 Analyzing usage modalities that span resource types clear. In general, jobs run within a Research allocation can requires conversion factors between the local units of each be almost completely mapped to the production modality, type. While several individual resource providers within the but jobs run within a Startup allocation currently fall into two TeraGrid have their own conversions between storage and categories: small-scale production (which should be mapped compute metrics (for example, NCAR’s “generalized to the production modality) and exploration/porting (which accounting unit,” or GAU, formulas allow conversions should be mapped to the exploration/porting modality.) between HPC core-hours and archival system storage units), We have begun asking users the following question, as there is no general TeraGrid conversion factor and we do not part of their allocation request: assert any specific equivalence here.

Load more