LHCb-CONF-2009-043 20/10/2009 Abstract eea eino Hbssfwr aaeetaogwith the along here management software present LHCb’s We of design possible. general as quickly, automated pushed as be and can updates reliably or software re- new administrators, that of quires number small a by envi- operated heterogenous ronment, systems large This operating var- MS-Windows. dominant and on The applications are different systems. of operating number ious large a sys- run These routers tems appliances. network storage and systems, databases embedded switches, and and servers of sands eoe n otaeapiaino rmwr sn a using framework or application speci software a and this veloper of context term the the In paper LHCb. for system management ware of complexity well. the as management increases software hardware the the the The but of [2], variety administrators. elsewhere large system described of is and team management managed hardware small be a to need by which operated hardware and software of facility. entire the of control overall assure the which systems, SCADA of consists complex control com- to it simple power-supplies embedded from LHCb and ranging trig- of applications desktops running the case servers, puters of 1000 the than core In more the of system. experi- at control is these and infrastructure all ger IT Like large CERN. a at ments Collider Hadron Large SMS, ex- discuss Microsoft problems. and and and integrated periences and Quattor adapted been / have LinuxFC they how tools: main the ∗ h ievreyo ak rnswt tefqieavariety a quite itself with brings tasks of variety wide The the at experiments large four the of one is [1] LHCb thou- with infrastructure IT online large a has LHCb h olwn eurmnsms emtb sflsoft- useful a by met be must requirements following The ∙ ∙ ∙ [email protected] ain o ag ytmamr clbeslto is solution lo- scalable more single a however. a needed system from large a simple software For the quite cation. gets is user every management when software Scalability: back software. and a forth of change versions to different between easy be must it Versioning: and 32-bit in as be well 64-bit. must as versions MS-Windows several in and supported Linux Multi-platform:  ic fsoftware. of piece c .Neufeld N. OTAEMNGMN NTELC NIESYSTEM ONLINE LHCb THE IN MANAGEMENT SOFTWARE user REQUIREMENTS INTRODUCTION ∗ eesto refers .Bncos,L rra .Coir .Mie EN eea Switzerland Geneva, CERN, Moine, G. Closier, J. Brarda, L. Bonaccorsi, E. , both .Dguez,EF,Luan,Switzerland Lausanne, EPFL, Degaudenzi, H. ua prtro de- or operator human a prtn Systems Operating dating RPM pr version. con themselves required by the in and packages required all stall a a installing easy very makes package software which pow- management, a has dependency YUM the erful systems. and desktop RedHat Fedora on available used Man- widely freely Update is Yellowdog which (for tool, automatisa- YUM [5] Further the agement) using database. done a is around tion built is it because ef provides Pack- s RPM Redhat Scienti [4] the (RPM) Management uses used. age consequently are Red- and 64-bit succesful distribution and very Hat commercially 32 the in from both derives Linux 5, and 4 sion a software but name few. the shell, to the web-browser, all as word-processor, such also ssh-client, useful: but an OS purpose small) general a quite make is which (which OS core described exist. are they tools where “native” existing and categorized is tem oiiswl edsrbdi h following. the and in rationale described underlying be the will and policies tools of set a Accordingly  ut hti rcieteecnb ocmo olfrall. for tool common no be can there practice in that culty ∙ ∙ soeaigssesScienti systems operating As the only not comprise to understood is here software OS sys- Online LHCb the in used software the section this In ∙ The  gs npriua prtn ytm(S software (OS) system pack- operating speci of particular needs versions In managing simply ages. is then management than software more capabilities: Management of phase commissioning experiment. the an urgent in as frequent it bypass are to changes temptation strong lest a fast, be very will be there should distribution Software Speed: utb pnadcmail iheitn solutions. existing with compatible soft and open a be So must practice replaced. in which be tools, cannot legacy exist- and of infrastructure lot a ing is there However computer. the single every allow would system of mangement one ideally Compatibility: machine. individual to machine dividual uainmngmn n hyaebae towards biased are they and management guration ..rnigtelts eso vial.Installing available. version latest the running i.e. ,  s n atpitarayhg-ih h ao dif- major the high-light already point last and rst HBSSFWR BASIS SOFTWARE LHCB’S   con c in eso aaeeto packages of management version cient all otaei h nieeprmn on experiment entire the in software  uainwihcnvr rmin- from vary can which guration vd iteor little ovide U ilatmaclyin- automiatcally will YUM s se omng packages. manage to ystem  iu EN[]i ver- in [3] CERN Linux c aemngmn tool management ware ospotfor support no up-  c several versions of the same package requires some special Windows 2003 are very mature operating systems. Except- effort. ing security there are not so many updates or feature addi- The second family of operating systems are Microsoft tions. Windows XP and Windows 2003 . Currently these are only used in their 32-bit versions. Microsoft has its Linux Software Management own propriatary system for OS updates (Windows Update) which can be mirrored locally. Software packages are (ide- Linux software which is not part of the physics software ally) distributed as MSIs, but this is not such a widely es- and hence managed using CMT is only accepted in form tablished standard as RPM in the RedHat world and indeed of RPMs. As it is quite easy to create RPMs, any soft- we have Windows applications which come as exectuables, ware which we are asked to install is rst packaged into an various install-kits or even as bare ZIP-archives. Version RPM, usually by an administrator. Good RPMs use pris- management is also rather weak as it consists essentialy of tine sources, that means that the packager takes the orig- free-form version numbers in the Windows registry, whose inal software as it is and only applies a patch to change usage are not enforced by the OS. Conguration manage- whatever is necessary to package the software properly. ment is not done by either of these tools. In modern Win- Unless there are strict requirements against this, all soft- dows installations (“domains”) this is achived by the use of ware packaged by the LHCb Online administrators is in- group policies. For package management we use Microsoft stalled in /usr/local. As has been mentioned RPMs and System Management Server (SMS) [6]. YUM repositories are not very suitable for managing mul- tiple versions. However even rigorous testing can not cover Application Software all potential problems with a new version of a package, and more often than not it becomes necessary to revert to a pre- The “physics” software in LHCb, that is all the code vious version on all or, more difcult, part of the cluster. written to analyse and process physics data, is version- A toolsuite has been developed for the LHC grid initiative, managed by a custom system called CMT [7] and dis- which is called Quattor [10]. An adaptation of Quattor for tributed in the form of tar-archives. Conguration of the control systems developed at CERN is LinuxFC (Linux for physics applications is done either by text les (so called Controls). LinuxFC/Quattor allow repositories containing option les) or via steering scripts written in Python. One all versions of a package and to dene in a database which of CMTs strong points is that it is very easy to have mul- machine receives which version of a specic package. The tiple versions of the same package installed and choose (in descriptions of (groups of) machines are called templates. principle) freely between them at run-time. Conguration Installation of the RPMs is handeled by service processes is limited to versions of sub- and dependent packages. running on each machine. Quattor components allow con- The Experiment Control System (ECS) is based on the guring almost every “standard” package on a Linux sys- JCOP framework [8]. JCOP has developed its own compo- tem, including for example privileges for users. nent management system which includes versioning sup- port and (optionally) a centralised database with automated Quattor makes it easy to group machines and also au- installation of components. These components are es- tomatizes the installation. Quattor has two disadvantages: sentially scripts and data-structures for use of the PVSS rstly it does not provide a tool for dependency check- SCADA software used throughout the ECS. JCOP provides ing, so that missing dependencies are only discovered at a conguration and package management build using an package installation, secondly it has a distinct avour of a Oracle database, which is currently not used by LHCb. Unix system administration tool, which results in a steep learning curve and consequently a tendency of people to SOFTWARE MANAGEMENT IN LHCB try to bypass it. We are actively working on higher-level tools to simplify the utilization and increase the the user- LHCb has an isolated local area network. Isolated means friendliness. that (almost) no packets are directly routed between the For example we have developed a Nagios plugin [2] to CERN networks, let alone the Internet and the LHCb con- verify that the actual installation of a package has worked trols network [9]. Access from outside is only possible on a node. An alarm and an email are generated, when via dedicated gateway machines, which are dual-homed. the installation fails on a specic node. This is useful as This architecture allows LHCb to only update the very few conicts can depend on the specic software environment exposed gateway machines whenever important security of a node and by no means all nodes are equal as in general patches come out. These machines do not run any oper- only required software is installed. ationally important tasks, i.e. the experiment can run with- out these machines, albeit in this case it can be monitored Windows Software Management only by on-site operators. For machines inside the network, OS updates are infre- There are only about 100 Windows machines in the quent and reserved for special maintenance windows. For LHCb Online system, much fewer than Linux (well over Linux these updates are in form of RPM package-lists for 1000), nevertheless our experience is that it is crucial to Windows typically as service packs. Both RHEL4/5 and have automated management as well. Conceptually what we do is very similar to the Linux are rarely necessary, we try to maximise the uptime of part by replacing RPMs with MSIs and Quattor with a the machines, normally only the applications are restarted. combination of group policies and SMS. Again software Many changes on Windows require that the users login a which we get is re-packaged in MSIs for easy deployment. new session, as many group policies are typically applied While the rst impression SMS and the Microsoft group at login. policy editor is very good and they seem easier to use than Software installations, even major ones, are fairly quick. Quattor, in practice it takes a lot of experience and exper- Typical times for a change on all Linux machines (more imentation to decide which conguration task to achieve than 1000) are in the order of 5 minutes, even though there with which tool. Microsoft has released an updated ver- is only a single software server currently in operation. In sion of SMS now called System Center Conguration Man- the future this can be upgraded if scalability demands it. ager 2007, which we are currently putting production and Software installations on the shared lesystem are, natu- which promises to greatly facilitate managment, in partic- rally even quicker, however there are scalability issues at ular when the migration to Windows 2008 server and Win- run-time. dows 7 will have taken place. CONCLUSION CMT Managed Software Strict version and conguration management is manda- The physics application software used in the High Level tory to successfullly run a large infrastructure, in particular Trigger are distributed as CMT packages in Unix tar-balls. with a very small team of system administrators. LHCb Since they contain no meta-information in these tar-balls, Online has four broad categories of software, each with all version management is done by external tools using their own installation tool. These come from long tradi- version-numbers in the le-names of the tar-balls. While tions and are well maintained outside LHCb so we have conceptually simple this approach has disadvantages, when abstained from trying to integrate them into one common something goes wrong. SMS and RPM both are based on a tool, which would be an enormous effort in repackaging. data-base approach and allow for easy roll-back, which is LinuxFC/Quattor and Microsoft SMS are at heart of soft- inherent in the tools, i.e. roll-back requires little or no sup- ware and conguration management for all the base ser- port from the packager. On the other hand since CMT is vices and all the non-LHCb specic software. While pow- also used to manage the physics software on remote site erful these tools remain complex. We try to make the tools connected to the LHC computing grid, we can leverage more user-friendly for non-system experts to ease the bur- the distribution facilities developed for this purpose and den by wrapping them in high-level scripts and utilities, updates run automated pulling packages in from a central which hide the complexity and automatise most of the rou- web-repository. tine tasks. We install this software only once in a shared le- system [11]. Replication to more places within the LHCb REFERENCES network, which might be desirable to improve scalability, will require additional work, because these nodes have no [1] A. Augusto Alves et al. The LHCb Detector at the LHC. access to the central repository. JINST, 3:S08005, 2008. This software does not require much special congura- [2] Enrico Bonaccorsi and Niko Neufeld. Monitoring the LHCb tion other than selecting the desired version and congur- experiment computing infrastructure with NAGIOS. In Pro- ing load-paths accordingly. The conguration of internal ceedings of ICALEPCS 2009, 2009. parameters is part of the operation of the trigger itself and [3] Scientic Linux CERN home-page. not managed by the system administrators. [4] Redhat home-page. Operational Aspects [5] Yellowdog Updater Modied home-page. [6] Microsoft Systems Manager Server home-page. Only administrators are allowed to perform updates. While this ensures a minimum level of checking, it poses [7] Conguration Management Tool home-page. a burden on a rather small team. Quattor and SMSC [8] The Joint Controls Project home-page. both have functionality for ne-grained privilege separa- [9] Guoming Liu and Niko Neufeld. Management of the LHCb tion which would allow to manage sub-sets of software and network based on SCADA system. In Proceedings of nodes, however both these tools are complex and difcult ICALEPCS09, 2009. to learn for physicist users. We are currently working on [10] Quattor homepage. high-level tools to facilitate certain tasks in a robust way, [11] Rainer Schwemmer and Niko Neufeld. In Proceedings of which will then allow delegating them. ICALEPCS09, 2009. Updates must be tested on a test-platform rst. We have test installations of all major systems, including an entire “dummy-detector” for this purpose. Afterwards updates are pushed, usually during a maintenance window. Reboots