Parallelism of Machine Learning Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
10/9/2013 Outline Condor Presented by : Walid Budgaga The Anatomy of the Grid Globus Toolkit CS 655 – Advanced Topics in Distributed Systems Computer Science Department Colorado State University 1 2 Motivation Motivation High Throughput Computing (HTC)? HTC is suitable for scientific research Large amounts of computing capacity over long Example(Parameter sweep): periods of time. Testing parameter combinations to Measured: operations per month or per year High Performance Computing (HPC)? keep temp. at particular level Large amounts of computing capacity for short periods of time op(x,y,z) takes 10 hours, 500(MB) memory, I/O 100(MB) Measured: FLOPS x(100), y(50), z(25)=> 100x50x25=125,000(145 years) 3 4 Motivation HTC Environment Fort Collins Large amounts of processing capacity? Exploiting computers on the network Science Center Utilizing heterogeneous resources Uses Condor for Overcoming differences of the platforms By building portable solution scientific projects Including resource management framework Over long periods of time? System must be reliable and maintainable Surviving failures (software & hardware) Allowing leaving and joining of resources at any time Upgrading and configuring without significant downtimes Source: http://www.fort.usgs.gov/Condor/ComputingTimes.asp 5 6 1 10/9/2013 HTC Environment HTC Also, the system must meet the needs of: Other considerations: Resource owners The distributive owned resources lead to: Rights respected Decentralized maintenance and configuration of resources Policies enforced Resource availability Customers Benefit of additional processing capacity outweigh complexity of Applications preempted at any time usage Adds an additional degree of resource heterogeneity System administrators Real benefit provided to users outweigh the maintenance cost 7 8 Condor Overview Open-source high-throughput computing framework for computing intensive tasks. Manages distributive owned resources to provide large amount of capacity Developed at the Computer Sciences Department at the University of Wisconsin-Madison Name changed to HTCondor in October 2012 9 10 Condor Overview Condor Overview Customer agent Represent the customer job(application) Can state the its requirement as following: Need a Linux/x86 platform Want the machine with the high memory capacity Prefer a machine in the lab 120 11 12 2 10/9/2013 Condor Overview Condor Overview Resource agent Matchmaker Represent the resource Matches jobs and resources Can state its offers as following: Platform: a Linux/x86 platform based on requirements and offers Memory: 1GB Can state its requirements as following: Notifies the agents when a match found Run jobs only when keyboard and mouse are idle for 15 m Run jobs only from the computer department Never run jobs belong to [email protected] 13 14 Challenges of HTC system: Software Development System Administration 15 16 Software Development Software Development Four primary challenges Utilization of heterogeneous resources: Utilization of heterogeneous resources Requires system portability obtained through layered system design • Network API : Requires system portability. • Connection-oriented and connectionless Network Protocol Flexibility • Reliable and unreliable interfaces. Required to cope with constantly changing of the resource and customer needs • Authentication and encryption Required for adding new features • Process management API : • Create , suspend, resume, Remote file access and kill a process. Required for giving ability for accessing data from any workstation • Workstation statistics API: Utilization of non dedicated resources • Reports information needed to Implement resource owner policies Required for preempt and resume application. Verify the validation of the applications requirements 17 18 3 10/9/2013 Software Development Software Development Network Protocol Flexibility: Remote file access(1): To cope with adding new services in HTC without frequently updating HTC To guarantee that HTC applications can access their data components, general purpose data format may be used from any workstation in the cluster. • Three possible solutions: • For example: Condor uses protocol similar to RPC • Using existing distributed file system (NFS) • Authenticates customer application, • Condor: • Privileges need to assigned, or • Grant file access permission 19 20 Software Development Software Development Remote file access (2): Remote file access(3): • Redirecting file I/O system calls • Implementing data file staging • Interposing HTC between application & operating system • By Linking application with an interposition library • Transferring input and output files to remote workstation • Does not require file storage on remote workstation specified by customer • Reduce performance. • Require free disk space on workstation • Difficult to develop & maintain portable interposition • High cost for large data files 21 22 Software Development Software Development Utilization of non-dedicated resources Checkpoints in Condor (1) Requires the ability for preempting and resuming application. Used as migration mechanism It can be obtained using checkpoints Job scheduler to migrate jobs from workstations to others Checkpoint: Used to resume a vacated jobs snapshot of the state of the executing program The program has the ability to checkpoint itself It can be used to restart the program at a later time Using a checkpointing library Provide reliability To provide additional reliability Enable preemptive-resume scheduling HTCondor can be configured to write checkpoints periodically 23 24 4 10/9/2013 Software Development Software Development Checkpoints in Condor (2) Checkpoints in Condor (3) When checkpoints are stored: Storing of checkpoints Periodically, if HTCondor is configured By default, At any time by the program checkpoints are stored on local disk of the machine When higher priority job has to start on the same machine where job was submitted When the machine becomes busy However, It can be configured to stored them on checkpoints server 25 26 System Administration Administrator has to answer to: Resource owners By guaranteeing that HTC enforces their policies Customers By ensuring receipt of valuable services from HTC Policy makers By demonstrating that HTC is meeting the stated goals. 28 27 System Administration System Administration Access Policies Access Policies Example from Condor: Specifies when and how the resources can be accessed and by whom The policies might be specified using a set of expressions For example in Condor: Requirements (true: to start accessing the resources) Rank (preference) Suspend Continue Vacate (notification to stop using resources) Kill (immediately stopping using the resources) 29 30 5 10/9/2013 System Administration System Administration Reliability System logs The HTC must be prepared against failures and It is primary tools for diagnosing system failures. It gives the ability to must be automate failure recover for common failure. reconstruct the events leading up to the failure . Problems and Suggested solutions It is not easy job Logs files can grow to unbounded size. Detect difference between normal and abnormal termination Detailed logs for recently events and summaries for old information Don’t leave running applications unattended Managing distributed log file Choose the correct checkpoint to restart Store logs centrally on a file server or a customized log server Decide when it is safe to restart the application Provide single interface by installing logging agents on each workstation Determine & avoid bad nodes 31 32 System Administration System Administration Monitoring and Accounting CondorView Usage Graph It helps the Administrator to: Assess the current and historical state of the system Track the system usage 33 34 System Administration System Administration Security (1) Security (2) Possible attacks To protect against an unauthorized of resource access policy Resource attack Resource owner may specify authorized users in his access policy Unauthorized user gains access to a resource Condor Example: Authorized user violates resource owner’s access policy Requirement = (Customer == “[email protected]”) || Customer attack ( Customer == “[email protected]”) Customer’s account or files are risked via HTC environment 35 36 6 10/9/2013 System Administration System Administration Security (3) Security (4) To protect against violations of resource access policy, To protect the customer’s account and files The resource agent may: HTC must ensure that all resource agents are trustworthy Set resource consumption limit by using system API Placing data files only on trusty hosts Run the application under “guest” account Using authentication mechanism Set file system root directory to “sandbox” directory Encrypting network streams Intercept the system calls performed by app. via OS interposition interface 37 38 System Administration Remote Customers Remote access is more convenient than direct access Customer creates an HTC account Customer agent can be installed on customer workstation The administrator allows this agent to access the HTC cluster For non- trustworthy customers, extra security procedures may be required 39 40 Condor Condor Condor is suitable for high throughput computations Running programs unattended and in the background Redirecting console input & outputs from and to files Running many