FlexiScale Next Generation Data Centre Management

Gihan Munasinghe Paul Anderson Xcalibre Communications, School of Informatics, Livingston, UK University of Edinburgh, UK [email protected] [email protected]

Abstract— Data centres and server farms are rapidly becoming ing”1. Conventionally, this still involves dedicated hardware a key infrastructure component for businesses of all sizes. – as the load increases, more dedicated machines would be However, matching the available resources to the changing level allocated (from a pool of idle machines) and some mechanism of demand is a major challenge. used to “load-balance” between them. FlexiScale is a data centre architecture which is designed to deliver a guaranteed QoS level for the exported services. It does Whilst this approach is a scalable solution, it suffers from a this by autonomically reconfiguring the infrastructure to cater for number of problems. There is still a considerable inefficiency fluctuations in the demand. FlexiScale is based on virtualisation in allocating dedicated servers – at any one time, a large technology which provides location- and hardware-transparent percentage of the machines may be running at a very low services. It currently uses Virtual Iron [2] as the management load average. It can also take a significant time to load and platform, and a -based virtualisation platform [5]. In this paper, we describe our experiences and difficulties in reconfigure additional servers which means that there is quite a implementing the FlexiScale architecture. Phase I is currently in high latency in responding to requests for increased resources. production - this provides a scalable, fault tolerant hardware The use of “virtual machines” is becoming a popular architecture. Phase II is currently at the prototype stage - this solution to this problem. Several virtual machines can co-exist is capable of automatically adding and removing virtual servers on the same physical hardware if their resource requirements to maintain a guaranteed QoS level. are sufficiently low. If the requirements of a VM increase, then it can be “migrated” to a different physical machine which I.INTRODUCTION has less contention for the resource (of course, it would also be possible to migrate one of the other VMs to make more Catering for fluctuations in demand, is one of the most sig- resources available to the existing physical machine). This has nificant problems for many business IT departments. Internet clear cost benefits, as well as an ability to react more quickly services, in particular can be subject to very rapid and extreme to changing demands. changes. For example, a very small company may easily see Phase I of the FlexiScale project provides a data centre a hundred-fold increase in the web server load following a architecture based on migrating virtual machines. This in- television advertisement. cludes the virtualisation platform, shared storage, and a control Traditionally, services will be allocated to dedicated servers, infrastructure. These provide fault-tolerant virtual machines and the “solution” this problem is simply to over-provision which the customer can manipulate manually, or via a well- the hardware. But this is far from ideal – peak loads are still defined API. likely to swamp the allocated hardware, and the normal load Phase II is intended to add autonomic capability to the will lead to idle machines which are expensive to own, and production system by monitoring QoS levels and reconfiguring (increasingly) to power. Of course, increasing or decreasing the system automatically to maintain the required service level. the resources allocated to any service involves reassigning the dedicated hardware, and perhaps even physical reallocation. II.UNDERLYING TECHNOLOGIES Bill LeFebvre describes a good example of of this [7] – on FlexiScale is built on top of existing virtualisation and 9/11 the load on the CNN news service was such that they storage technologies: needed to increase their number of servers from 10 (at 08.45) to 52 (by 13.00)! A. Hardware virtualisation Many businesses outsource their backend data centre to Virtual machines access the physical hardware through companies which have the technical and infrastructure re- a layer known as the . There are two types of sources to manage them (Xcalibre [4] is one such company). hypervisor: This provides an economy of scale. In particular, the ability to • A Type 1 (or native or bare-metal) Hypervisor runs share resources between different customers means that more directly on the hardware platform. The guest operating resources can be made available to handle peak loads on any one particular service. This is the basis of “Utility comput- 1http://en.wikipedia.org/wiki/Utility computing system runs on top of the hypervisor. XEN [5] is a type one Hypervisor. • A Type 2 (or hosted) Hypervisor runs within an operating system environment - ie. guest operating system runs two levels above the hardware.VMware Server [3] (formerly known as GSX) is an example of a type two hypervisor. Currently, FlexiScale uses a XEN-based hypervisor2 to provide hardware virtualisation on Intel VT or AMD-V pro- cessors. This is a type one Hypervisor, chosen to provide maximum performance.

B. Migration FlexiScale relies on the ability to perform live migration of Fig. 1. Customer Control Panel (1) virtual servers between physical machines (without stopping and restarting the servers). III. FlexiScale PHASE I The FlexiScale architecture is modular and can accommo- The objective of FlexiScale phase I has been to build date different implementations of this functionality. Currently, a solid infrastructure. This provides customers with fault- we use Virtual Iron (VI) [2]. This is built on top of the tolerant virtual servers which they can manipulate manually. It XEN Hypervisor and works as an external management layer is also intended to provide a solid basis for the development of for the virtual servers. A “management station” supports phase II which will support the autonomic migration. Phase creation, removal, migration, and starting/stopping of virtual I has been in production use since October 2007 and now servers. Crucially, VI also supplies a Java API which allows supports hundreds of virtual servers. the data centre to be managed programmatically. This API The key component of phase I is the management station. provides access to all the functions of the management station, This unifies the underlying technologies and forms the inter- including live migration. face between the user and the infrastructure. The control panel C. Storage allows the user to stop, start and reboot virtual servers, as well as reconfigure the server specifications (memory etc). It FlexiScale relies on a centralised storage back-end to main- also manages the networking and storage allocation, including tain all persistent data during VM migration. This allows VLAN configurations, DHCP access, firewalling, and disk failed nodes to be instantly rebooted on some other physical creation in the Netapp . hardware. When a user starts their virtual server, the management All stored data currently comes from a centralised SAN station performs a number of steps: back-end which stores both operating system boot images and 1) Find a processing node with the appropriate RAM customer data. We currently use a NetApp FAS3050 which capacity is a hybrid SAN/NAS device. This has a maximum storage 2) Load the appropriate boot image from the SAN capacity of 168TB spread over 336 drives. We use an active- 3) Create a virtual server in the physical node allocating active configuration with two heads that fail-over instantly in the appropriate RAM case of a HW fault. 4) Mount the appropriate disk images from SAN. This forms a “single-point of failure” and needs to be 5) Start the virtual server with boot image. extremely robust: The interface to the management station is available via a • The disk shelves have a passive back-plane, redundant SOAP API, as well as an interactive web page – see figures power supplies and fans 1 and 2. • The shelves are dual-connected to the heads via FC The management station also provides fault-tolerance. Fail- connectors ure of a physical node is detected by monitoring a heart beat, • The disk shelves run in RAID-DP (or RAID 6) with a and the management station will redistribute the virtual servers spare disk per shelves for fast rebuilds from the failed node among other physical nodes. Since the • The heads run in active-active mode with an Infiniband data is shared via the SAN, the failure time of a service is interconnect that allows to failover instantly should one limited to the boot time of the new instance. head fail The load on the physical servers is monitored and this can be • The system is actively monitored by NetApp with a 4hr balanced by migrating virtual servers between physical nodes. delivery of spares Currently (phase I) this is a manual process. • Each NetApp head is connected to two switches which allows for failure in the cabling or switching architecture A. Some difficulties We faced a number of practical difficulties during the 2Supplied by VI - http://www.virtualiron.com/products/open source.cfm development of the first phase: Fig. 2. Customer Control Panel (2)

1) LUN Limits: The ability to migrate any virtual machine IV. FlexiScale PHASE II to any physical machine is crucial. This means that every Phase II of the FlexiScale project is intended to provide physical server needs to be capable of attaching every virtual customers with a guaranteed quality of service (QoS). The disk (LUN) in the entire cluster. However, the kernel has response times of services will be monitored and the allocated some limitations in the iSCSI implementation which mean that resources will be increased or decreased to match the level of this would not be possible at the anticipated scale. We were demand. Figure 3 shows the overall architecture. eventually forced to write a iSCSI LUN management layer that circumvented these limitations and enabled us to scale up A. Automatic resource scaling to the originally anticipated cluster size. The ability to scale resources automatically in response to 2) Reboot times: In the case of a total cluster reboot, each fluctuating demand is the key to providing a guaranteed quality physical node needs to scan every LUN before it can boot of service. There are two ways of scaling the resources per the virtual machines (this is a consequence of the decision service: that every physical node should be able to see every LUN). Unfortunately, the VI LUN scan was initially sequential and 1) Vertical Scaling – this means varying the resources it could take up to four hours to boot the cluster. We worked available to a single virtual server. Eg. extra memory, with VI to change the LUN scanning process and scanning extra disk space and extra CPU capacity. This can times are now in the order of seconds 3. involve migration, either of the server itself, or of other 3) Management database archiving: Another issue with servers using the same physical machine. VI was the object-oriented database used to record various 2) Horizontal Scaling – this means varying the number events. The management server became unusable when this of virtual servers allocated to a service. To increase grew beyond a certain size, and the archiver was unstable capacity, new instances of the service are created by with databases of our production size. Forcing a full database cloning the virtual machine and deploying the copies on rebuild meant that the control system was unavailable for other physical servers. The multiple copies of the service several hours at a time. VI have subsequently fixed the archiver are load-balanced. When the demand reduces, instances and we can successfully keep the database below the critical are removed from the load balancer and the processing size. cluster. 4) iSCSCI vs FibreChannel: For cost reasons, it was de- FlexiScale is intended to support both of these approaches - cided to use iSCSI (the Open iSCSI implementation) instead there are advantages and disadvantages to each. Horizontal of Fibre Channel (FC) to attach the storage to the virtual scaling is only appropriate for services which can easily machines. However, iSCSI has not proven as stable as FC be distributed and managed by allocating additional servers. and the throughput and latency are not so good - although However, this is well-suited to many web-services, and can this is improving rapidly. One specific problem has been the be performed without disrupting the running service. Vertical Open iSCSI implementation of Multi-Path-IO - this does not scaling is appropriate to most applications, but it is limited handle the head-failover of the NetApp correctly (due to the by the resources of a single physical machine. Currently, the change in MAC address). underlying technologies also require a machine to be taken offline to change the CPU capacity. 3In general, VI have been very responsive to resolving issues with their The current prototype for Phase II of FlexiScale implements software horizontal scaling. Within a 18 minute period of fluctuating Fig. 3. The FlexiScale Architecture usage, the system is capable of scaling up by six new virtual latency, and sampling frequency. The Sandpiper project [8] servers and load balancing them. When the demand, reduces, project has some useful observations on these problems. the service is scaled down by removing the extra virtual 2) QoS monitoring: Monitoring the true QoS is also dif- servers. This prototype was developed to scale a simple web ficult. We intend to do this, where possible, by a completely site, where the QoS levels are bound by request latency and independent system. This should provide a true end-to-end the maximum number of connections. measurement – for example a separate machine will make requests for specified web pages and measure response times. B. Current work 3) Decision making: A fully autonomic system needs to More work is required to make the prototype suitable for automatically detect the cause of any performance problems production use. We are currently investigating a number of and implement a reconfiguration to solve them. We believe that areas: most performance problems will be due to CPU contention 1) Performance monitoring: Monitoring the performance (rather than network and storage bandwidth). This means that the virtual machines presents some challenges. A black box most of the problem resolution strategies involve the detection approach involves monitoring the virtual machine “from the of “hotspots” in the cluster, and migration of virtual machines outside” (the hypervisor), and not relying on any code running to avoid them. Optimal strategies for this are not obvious in the virtual machine itself. However, this does not always (see [6]), and are complicated by other considerations – for provide sufficient information and a “grey box” approach may example a machine which is making rapid writes across a large be necessary – this relies on monitoring code within the virtual memory may dirty the memory faster than it can be migrated machine. Since the contents of the virtual machine are the to a new server! responsibility of the customer, this can lead to reliability and 4) Charging: Virtual machines allow a much more efficient security problems. On the other hand, it may be useful to allow use of the physical hardware, creating cost-savings which can the customer to report their own performance statistics which be passed on to the customer. However, the shared resources could then be tailored to the characteristics of their application create a less predictable environment – the service available – in the general case, it can be difficult to detect the onset to any one user will depend on the activity of other users. of “real” performance problems without a knowledge of the Some customers may be interested in a guaranteed minimum application. There are also issues with collating information, response time, others may be interested in the average compute power available over a period. There is no single, obvious charging model.

V. CONCLUSIONS Virtual machines are revolutionising data centres and service provision. In comparison to dedicated machines, they have the potential to significantly reduce costs, and increase reliability and flexibility. Applications also run in the same was as they would on dedicated machines, and no new technology or approach is required from the customer. Commercial services such as Amazon’s EC2 [1] take ad- vantage of this to provide flexible compute power, but these services must be scaled manually – this is not appropriate for applications which experience wide fluctuations in their re- source requirements. We have demonstrated that it is possible to build a system which scales automatically in response to this demand.

REFERENCES [1] Amazon elastic compute cloud. Web page: http://www.amazon.com/gp/browse.html?node= 201590011. [2] Virtual iron. Web page: http://www.virtualiron.com/. [3] Vmware. Web page: http://www.vmware.com/. [4] Xcalibre communications. Web page: http://www.xcalibre.co.uk/index.html. [5] Xen. Web page: http://www.xen.org/. [6] K. Begnum, M. Disney, Æ. Frisch, and IngardMevag.˚ Decision support for virtual machine re-provisioning in production environments. In 21st Large Installation System Administration Conference (LISA ’07). Usenix, 2008. Available from: http://www.usenix.org/events/lisa07/ tech/full papers/begnum/begnum.pdf. [7] W. LeFebvre. CNN.com: Facing a world crisis. login;, 27(1):83, February 2002. Available from: http://www.usenix.org/events/lisa01/ lisa2001confrpts.pdf. [8] T. Wood, P. Shenoy, A. Venkataramani, and M. Yousif. Black-box and gray-box strategies for virtual machine migration. In Proceedings of the 4th Usenix Symposium on Networked Systems Design and Implementation. Usenix, April 2007. Available from: http://www.usenix.org/events/nsdi07/ tech/full papers/wood/wood.pdf.