Beyond Monitoring: Proactive Server Preservation in an HPC Environment
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Munin Documentation Release 2.0.44
Munin Documentation Release 2.0.44 Stig Sandbeck Mathisen <[email protected]> Dec 20, 2018 Contents 1 Munin installation 3 1.1 Prerequisites.............................................3 1.2 Installing Munin...........................................4 1.3 Initial configuration.........................................7 1.4 Getting help.............................................8 1.5 Upgrading Munin from 1.x to 2.x..................................8 2 The Munin master 9 2.1 Role..................................................9 2.2 Components.............................................9 2.3 Configuration.............................................9 2.4 Other documentation.........................................9 3 The Munin node 13 3.1 Role.................................................. 13 3.2 Configuration............................................. 13 3.3 Other documentation......................................... 13 4 The Munin plugin 15 4.1 Role.................................................. 15 4.2 Other documentation......................................... 15 5 Documenting Munin 21 5.1 Nomenclature............................................ 21 6 Reference 25 6.1 Man pages.............................................. 25 6.2 Other reference material....................................... 40 7 Examples 43 7.1 Apache virtualhost configuration.................................. 43 7.2 lighttpd configuration........................................ 44 7.3 nginx configuration.......................................... 45 7.4 Graph aggregation -
Univa Grid Engine 8.0 Superior Workload Management for an Optimized Data Center
Univa Grid Engine 8.0 Superior Workload Management for an Optimized Data Center Why Univa Grid Engine? Overview Univa Grid Engine 8.0 is our initial Grid Engine is an industry-leading distributed resource release of Grid Engine and is quality management (DRM) system used by hundreds of companies controlled and fully supported by worldwide to build large compute cluster infrastructures for Univa. This release enhances the Grid processing massive volumes of workload. A highly scalable and Engine open source core platform with reliable DRM system, Grid Engine enables companies to produce higher-quality products, reduce advanced features and integrations time to market, and streamline and simplify the computing environment. into enterprise-ready systems as well as cloud management solutions that Univa Grid Engine 8.0 is our initial release of Grid Engine and is quality controlled and fully allow organizations to implement the supported by Univa. This release enhances the Grid Engine open source core platform with most scalable and high performing advanced features and integrations into enterprise-ready systems as well as cloud management distributed computing solution on solutions that allow organizations to implement the most scalable and high performing distributed the market. Univa Grid Engine 8.0 computing solution on the market. marks the start for a product evolution Protect Your Investment delivering critical functionality such Univa Grid Engine 8.0 makes it easy to create a as improved 3rd party application cluster of thousands of machines by harnessing the See why customers are integration, advanced capabilities combined computing power of desktops, servers and choosing Univa Grid Engine for license and policy management clouds in a simple, easy-to-administer environment. -
DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2 Dr. Peter Tröger Hasso-Plattner-Institute, University of Potsdam [email protected] ! DRMAA-WG Co-Chair http://www.drmaa.org/ My Person • Senior Researcher at Hasso-Plattner-Institute, Potsdam • Research field: Dependable systems • New online failure prediction and recovery techniques (SAP ByDesign, TACC Ranger, IBM z196, Intel) • Fault injection on Firmware level (Fujitsu Technology Solutions) • New reliability modeling approaches (DSN paper pending ...) • Virtualization-based fault tolerance (VMWare, Xen, KVM) • Teaching • Dependable systems, parallel programming concepts, operating systems, middleware and distributed systems • Standardization in Open Grid Forum as side activity ... DRMAAv2 | OGF 35 $X PT 2012 Hasso-Plattner-Institute for Software Engineering (HPI) " Privately funded and independent research institute, founded in 1999! " Associated with the University of Potsdam, Germany! " B.Sc. and M.Sc. curriculum in IT-Systems Engineering! " Ph.D. programme! " Rich experience in research projects that are typically conducted with industrial partners, both on a national and international level! " Research school for PhDs with international departments (Cape Town, Haifa, China) DRMAAv2 | OGF 35 $X PT 2012 $X Open Grid Forum (OGF) Application Area End User Application / Portal Features SAGA API / OGSA / OCCI API Portabilit API Standards Proprietary Other OGF OGF Other DRMAA End User Application / Portal Meta Scheduler Features SAGA API + Backends API API Portabilit API API -
Rrdtool – Perl Module • Net::Snmptrapd(Install It from CPAN by Root) • Netsnmp::Agent(Embedded on Net-Snmp) Requirement – SNMP Agent
Homework 5 DNS、HTTPD、SNMP Requirements One dedicated domain name for yourself Setup DNS server with following records SOA, NS, MX Make them reasonable NS Delegation (with team mates) Dedicate a sub domain to each of your team mates Building a slave server for your team mate And a stub server for another team mate Updates should be synchronized Reverse resolution for your NAT 192.168.x.0/24 for each of team mates Requirements (Cont.) slave a.nctucs.net 140.113.a.a stub 192.168.0.1/24 b.nctucs.net c.nctucs.net 140.113.b.b 140.113.c.c 192.168.0.2/24 192.168.0.3/24 Requirements (Cont.) View Create view.example.csie.net A record: Queries from 192.168.0.0/24: view.example.csie.net A 192.168.0.1 Otherwise, get your normal ip Logging Record all records to /var/log/named.log Do log rotate Note: you will be asked for explaining the what does the log entry means in named.log Requirements (Cont.) SPF/DomainKeys record for your server Add resonable SPF/DomainKeys records Configure your mail system to support these feature SSHFP record for your server Make a reasonable setting Requirements (Cont.) Dynamic DNS update Your DNS should accept the update requests from 140.113.17.225 and your team mates You should know how to update a dns record Management Your DNS server should support TSIG and allow the connection from 140.113.17.225 Your DNS server should only allow the AXFR request from 140.113.17.225 Only allow recursion query from your team mates and 140.113.17.225 Appendix Use ldap as backend database dns/bind9-sdb-ldap http://www.openldap.org/ http://bind9-ldap.bayour.com/ SPF setup wizard http://old.openspf.org/wizard.html DKIMproxy http://dkimproxy.sourceforge.net HTTPD Requirements HTTPD apache, lighttpd, nginx, etc. -
Sun Grid Engine Update Daniel Gruber Software Engineer Sun Microsystems Deutschland Gmbh Sun Is a Wholly-Owned Subsidiary of Oracle
Sun Grid Engine Update Daniel Gruber Software Engineer Sun Microsystems Deutschland GmbH Sun is a wholly-owned subsidiary of Oracle 1 Content What's new in SGE? DRMAA Customer Feedback 2 Sun Grid Engine Releases Release Announcement Some Features... 6.2 major 23.09.2008 SDM, scalability (> 60000 cores), AR, IJS 6.2 update 1 18.12.2008 maintenance release GUI Installer, JSV, Per Job Resources, 6.2 update 2 31.03.2009 jemalloc SGE Inspect, SDM Cloud Adapter, 6.2 update 3 23.06.2009 Exclusive Host 6.2 update 4 23.10.2009 maintenance release Slotwise Preemption, Core Binding, 6.2 update 5 22.12.2009 enhanced Inspect, Java JSV, Array Job Throttling, Hadoop Support Sun Confidential: Internal Only 3 SDM – Service Domain Manager Grid Grid Grid Engine Engine Engine A B C Service Domain Manager Zzzzz Zzzzz Power Saving Spare Pool (via IPMI) Spare Pool CloudService Sun Confidential: Internal Only 4 JSV – Job Submission Verifier • Administrator (or users) can reformulate (insert, delete) job submission parameters based on a JSV scripts • Jobs can be rejected based on parameters • bash, csh, tcl, perl and JSV scripts are supported Sun Confidential: Internal Only 5 GUI Installer • Installs a complete SGE cluster Sun Confidential: Internal Only 6 Slot-wise preemption • Slot limit per host • Suspends jobs from subordinate queues in order to get high priority jobs to run • Suspends longest/shortest running jobs • Multiple layers (suspend trees) possible • Per layer: Order definable Sun Confidential: Internal Only 7 Core Binding • Job submission extension -
Release Notes for IBM Spectrum LSF Performance Enhancements
IBM Spectrum LSF Version 10 Release 1 Release Notes IBM IBM Spectrum LSF Version 10 Release 1 Release Notes IBM Note Before using this information and the product it supports, read the information in “Notices” on page 41. This edition applies to version 10, release 1 of IBM Spectrum LSF (product numbers 5725G82 and 5725L25) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any IBM Spectrum Computing documentation, or you have a suggestion for improving it, let us know. Log in to IBM Knowledge Center with your IBMid, and add your comments and feedback to any topic. © Copyright IBM Corporation 1992, 2017. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Release Notes for IBM Spectrum LSF Performance enhancements ........ 14 Version 10.1 ............. 1 Pending job management......... 16 What's new in IBM Spectrum LSF Version 10.1 Fix Job scheduling and execution ....... 21 Pack 3 ................ 1 Host-related features .......... 27 Job scheduling and execution ........ 1 Other changes to LSF behavior ....... 30 Resource management .......... 1 Learn more about IBM Spectrum LSF...... 31 Container support ........... 5 Product notifications .......... 31 Command output formatting ........ 5 IBM Spectrum LSF documentation....... 32 Logging and troubleshooting ........ 5 Product compatibility ........... 32 Other changes to IBM Spectrum LSF ..... 6 Server host compatibility ......... 32 What's new in IBM Spectrum LSF Version 10.1 Fix LSF add-on compatibility ........ -
The Cacti Manual.Pdf
The Cacti Manual Ian Berry Tony Roman Larry Adams J.P. Pasnak Jimmy Conner Reinhard Scheck Andreas Braun The Cacti Manual by Ian Berry, Tony Roman, Larry Adams, J.P. Pasnak, Jimmy Conner, Reinhard Scheck, and Andreas Braun Published 2017 Copyright © 2017 The Cacti Group This project is licensed under the terms of the GPL. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. All product names are property of their respective owners. Such names are used for identification purposes only and are not indicative of endorsement by or of any company, organization, product, or platform. Table of Contents I. Installation...................................................................................................................................................................................1 1. Requirements.....................................................................................................................................................................1 2. Installing Under Unix.......................................................................................................................................................2 -
Condor Via Developer Apis/Plugins
Extend/alter Condor via developer APIs/plugins CERN Feb 14 2011 Todd Tannenbaum Condor Project Computer Sciences Department University of Wisconsin-Madison Some classifications Application Program Interfaces (APIs) › Job Control › Operational Monitoring Extensions 2 www.cs.wisc.edu/Condor Job Control APIs The biggies: › Command Line Tools › DRMAA › Condor DBQ › Web Service Interface (SOAP) http://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=SoapWisdom 3 www.cs.wisc.edu/Condor Command Line Tools › Don’t underestimate them! › Your program can create a submit file on disk and simply invoke condor_submit: system(“echo universe=VANILLA > /tmp/condor.sub”); system(“echo executable=myprog >> /tmp/condor.sub”); . system(“echo queue >> /tmp/condor.sub”); system(“condor_submit /tmp/condor.sub”); 4 www.cs.wisc.edu/Condor Command Line Tools › Your program can create a submit file and give it to condor_submit through stdin: PERL: fopen(SUBMIT, “|condor_submit”); print SUBMIT “universe=VANILLA\n”; . C/C++: int s = popen(“condor_submit”, “r+”); write(s, “universe=VANILLA\n”, 17/*len*/); . 5 www.cs.wisc.edu/Condor Command Line Tools › Using the +Attribute with condor_submit: universe = VANILLA executable = /bin/hostname output = job.out log = job.log +webuser = “zmiller” queue 6 www.cs.wisc.edu/Condor Command Line Tools › Use -constraint and –format with condor_q: % condor_q -constraint 'webuser=="zmiller"' -- Submitter: bio.cs.wisc.edu : <128.105.147.96:37866> : bio.cs.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 213503.0 zmiller 10/11 06:00 0+00:00:00 -
Storage of Network Monitoring and Measurement Data
Storage of Network Monitoring and Measurement Data A report submitted in partial fulfillment of the requirements for the degree of Bachelor of Computing and Mathematical Sciences at The University of Waikato by Nathan Overall c 2012 Nathan Overall Abstract Despite the limitations of current network monitoring tools, there has been little investigation into providing a viable alternative. Network operators need high resolution data over long time periods to make informed decisions about their networks. Current solutions discard data or do not provide the data in a practical format. This report addresses this problem and explores the development of a new solution to address these problems. Acknowledgements I would like to show my appreciation to the following persons who have made this project possible. Members of the WAND Network group for their continued support during the project, including my supervisor Richard Nelson. I would also like to give a special mention to Shane Alcock and Brendon Jones for their ongoing assistance to the project while they developed the WAND Network Event Monitor. DR. Scott Raynel for his support and advice throughout the project. The WAND network group and Lightwire LTD for providing the resources necessary to conduct the project. Contents List of Acronyms vi List of Figures vii 1 Introduction1 1.1 Network Operation.......................1 1.2 Overview of the Problem....................2 1.3 Goals...............................2 1.4 Plan of Action..........................3 2 Background4 2.1 Introduction...........................4 2.2 Round Robin Database.....................4 2.3 Tools using Round Robin Database (RRD)..........8 2.3.1 Smokeping...........................8 2.3.2 Cacti..............................9 2.4 The Active Measurement Project...............9 2.5 OpenTSDB.......................... -
Performance-Analyse in Großen Umgebungen Mit Collectd
Performance-Analyse in großen Umgebungen mit collectd Performance-Analyse in großen Umgebungen mit collectd Sebastian tokkee\ Harl " <[email protected]> FrOSCon 2009 2009-08-22 Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Gliederung Was ist collectd? Wichtige Eigenschaften Wichtige Plugins Eigene Erweiterungen Uber¨ den Tellerrand Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Was ist collectd? I collectd sammelt Leistungsdaten von Rechnern I Leistungsdaten sind zum Beispiel: I CPU-Auslastung I Speichernutzung I Netzwerkverkehr I Daten werden erhoben, verarbeitet und gespeichert I H¨aufig: Darstellung als Graphen I Nicht verwechseln mit Monitoring! Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Kontakt I Homepage: http://collectd.org/ I Mailinglist: [email protected] I IRC: #collectd auf irc.freenode.net Web 2.0\: http://identi.ca/collectd I " Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Wichtige Eigenschaften Wichtige Eigenschaften I Daemon I Freie Software (gr¨oßtenteils GPLv2) I Portierbar (Linux, *BSD, Solaris, . ) I Skalierbar (OpenWrt, . , Cluster / Cloud) I Effizient (Default-Aufl¨osung: 10 Sekunden) I Modular (Uber¨ 70 Plugins) Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Wichtige Eigenschaften Wichtige Eigenschaften I Daemon I Freie Software (gr¨oßtenteils GPLv2) I Portierbar (Linux, *BSD, Solaris, . ) I Skalierbar (OpenWrt, . , Cluster / Cloud) I Effizient (Default-Aufl¨osung: 10 Sekunden) I Modular (Uber¨ 70 Plugins) Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Wichtige Eigenschaften Wichtige Eigenschaften: 10-Sekunden-Aufl¨osung Performance-Analyse in großen Umgebungen mit collectd Was ist collectd? Wichtige Eigenschaften Wichtige Eigenschaften I Daemon I Freie Software (gr¨oßtenteils GPLv2) I Portierbar (Linux, *BSD, Solaris, . -
Sun Grid Engine Update
Sun Grid Engine Update SGE Workshop 2007, Regensburg September 10-12, 2007 Andy Schwierskott Sun Microsystems Copyright Sun Microsystems What is Grid Computing? • The network is the computer™ > Distributed resources > Management infrastructure > Targeted service or workload • Utilization & performance ↑, costs & complexity ↓ • Examples: > Aggregating desktops for computation, aka cycle stealing > e.g. SETI@Home, use engineers' desktop at night > Managing an entire rack from a single interface > Rendering and simulation “farms” Copyright Sun Microsystems 2007 Page 2 What Sun Grid Engine does in Grid Computing • Helps solving problems horizontally > High Performance [Technical] Computing > Data center optimization • Examples: > EDA, modeling, transaction validation, MCAD • Increasing utilization, reduce turnaround times > 10%-25% is typical, go up to 90%++ > Cycle stealing • ==> Intelligently automate batch and interactive job distribution for jobs running from seconds to days and weeks Copyright Sun Microsystems 2007 Page 3 Target Industries & Typical Workloads Industries Computing Tasks Copyright Sun Microsystems 2007 Page 4 Sun Grid Engine Enterprise Allocation and Resource Prioritization Policies Selection Extensible Workload to Resource Matching Customizable System Load and Access Resource Regulation Control Definable Job Execution Contexts Web-based Reporting and Resource Analysis Accounting Open and Integratable Data Source Copyright Sun Microsystems 2007 Page 5 Sun Grid Engine Hierarchical Configuration Ease of Integration with N1 Administration Systems Management Products 3rd Party Standards-Compliant Software Full CLI Functionality Integration Heterogeneous Wide commercial Environments OS support Copyright Sun Microsystems 2007 Page 6 Sun Grid Engine Components qsub qrsh qlogin qmon qtcsh Shadow Master Copyright Sun Microsystems 2007 Page 7 Sun Grid Engine 6 • SGE 6.0 released in 2004 > Sites slowly adopt new functionality > .. -
Design of an Accounting and Metric-Based Cloud-Shifting
Design of an Accounting and Metric-based Cloud-shifting and Cloud-seeding framework for Federated Clouds and Bare-metal Environments Gregor von Laszewski1*, Hyungro Lee1, Javier Diaz1, Fugang Wang1, Koji Tanaka1, Shubhada Karavinkoppa1, Geoffrey C. Fox1, Tom Furlani2 1Indiana University 2Center for Computational Research 2719 East 10th Street University at Buffalo Bloomington, IN 47408. U.S.A. 701 Ellicott St [email protected] Buffalo, New York 14203 ABSTRACT and its predecessor meta-computing elevated this goal by not We present the design of a dynamic provisioning system that is only introducing the utilization of multiple queues accessible to able to manage the resources of a federated cloud environment the users, but by establishing virtual organizations that share by focusing on their utilization. With our framework, it is not resources among the organizational users. This includes storage only possible to allocate resources at a particular time to a and compute resources and exposes the functionality that users specific Infrastructure as a Service framework, but also to utilize need as services. Recently, it has been identified that these them as part of a typical HPC environment controlled by batch models are too restrictive, as many researchers and groups tend queuing systems. Through this interplay between virtualized and to develop and deploy their own software stacks on non-virtualized resources, we provide a flexible resource computational resources to build the specific environment management framework that can be adapted based on users' required for their experiments. Cloud computing provides here a demands. The need for such a framework is motivated by real good solution as it introduces a level of abstraction that lets the user data gathered during our operation of FutureGrid (FG).