Human-centric Composite Quality Modeling and Assessment for Virtual Desktop Clouds

Yingxiao Xu, Prasad Calyam, David Welling, Saravanan Mohan, Alex Berryman, Rajiv Ramnath University of Missouri, Ohio Supercomputer Center/OARnet, The Ohio State University, USA; Fudan University, China [email protected]; [email protected]; {dwelling, smohan, aberryman}@osc.edu; [email protected]

Abstract—There are several motivations (e.g., mobility, cost, security) that are fostering a trend to transition users’ traditional desktops to thin-client based virtual desktop clouds (VDCs). Such a trend has led to a rising importance for human-centric performance modeling and assessment within user communities in industry and academia that are increasingly adopting desktop virtualization. In this paper, we present a novel reference ar- chitecture and its easily-deployable implementation for modeling and assessment of objective user QoE (Quality of Experience) within VDCs, without the need for expensive and time-consuming subjective testing. The architecture novelty is in our approach to integrate finite state machine representations for user workload generation, and slow-motion benchmarking with deep packet inspection of application task performance affected by network health i.e., QoS (Quality of Service) variations to derive a “composite quality” metric model of user QoE. We show how this metric is customizable to a particular user group profile with different application sets, and can be used to: (i) identify domi- nant performance indicators for troubleshooting bottlenecks, and (ii) effectively obtain both ‘absolute’ and ‘relative’ objective user QoE measurements needed for pertinent selection/adaptation of thin-client encoding configurations within VDCs. We validate the effectiveness of our composite quality modeling and assessment Fig. 1. Virtual desktop cloud system methodology using subjective and objective user QoE measure- ments in a real-world VDC featuring RDP/PCoIP thin-client protocols, and actual users for a virtual classroom lab use case Fig. 1 shows the various system components in a VDC. within a federated university system. At the server-side, a hypervisor framework (e.g., ESXi, Xen) I.INTRODUCTION is used to create pools of virtual machines that host user VDs with popular applications (e.g., Excel, Internet Explorer, There are several motivations (e.g., mobility, cost, security) Media Player) as well as advanced applications (e.g., Matlab, that are fostering a trend to transition users’ traditional desk- Moldflow). Users of a common desktop pool use the same tops to thin-client based virtual desktop clouds (VDCs) [1] [2]. set of applications, but maintain their distinctive and personal With the increase in mobile devices with significant computa- datasets. The VDs on the server-side share common physical tion power and connections to high-speed wired/wireless net- hardware and attached storage drives. At the client-side, users works, thin-client technologies for virtual desktop (VD) access connect to a server-side unified resource broker via the Internet are being integrated into mobile devices. In addition, users are using various TCP (e.g., RDP) and UDP (e.g., PCoIP) based increasingly consuming data-intensive content that involve big thin-client devices. The unified resource broker handles all the data (e.g., scientific data analysis) and multimedia streaming connection requests through authentication of users by Active (e.g., IPTV) applications, and thin-clients are needed for these Directory (or other directory service) lookups, and allows applications that require sophisticated server-side computation authorized users to access their entitled VDs with appropriate platforms (e.g., GPUs). Further, using a thin-client may be resource allocation amongst distributed data centers. more cost effective than owning a full-blown PC due to lower To allocate and manage VDC resources for large-scale user device maintenance, and savings through central management workloads of desktop delivery with satisfactory user Quality of desktop support in terms of , application of Experience (QoE), VDC service providers (CSPs) need and security upgrades at the server-side. to suitably provision and adapt the cloud platform CPU, memory and network resources to deliver satisfactory user- This material is based upon work supported by VMware and the perceived ‘interactive response times’ (a.k.a. timeliness) as National Science Foundation under award numbers CNS-1050225 well as ‘streaming multimedia quality’ (a.k.a. coding effi- and CNS-1205658. Any opinions, findings, and conclusions or rec- ommendations expressed in this publication are those of the author(s) ciency). They need to ensure satisfactory user QoE when and do not necessarily reflect the views of VMware or the National user workloads are bursty during “flash crowds” or “boot Science Foundation. storms”, and also when users access VDs from remote sites with varying end-to-end network path performance. For this, As part of the offline benchmarking methodology, we lever- they need tools for VDC capacity planning to avoid system and age finite state machine representations for user workload task network resource overprovisioning. For instance, they need state characterization coupled with slow-motion benchmark- frameworks and tools to benchmark VD application resource ing [4] for deep packet inspection of VD application task requirements so that they can provision adequate resources performance affected by QoS variations. To define VD applica- to meet user QoE expectations (e.g., < 500ms for MS Office tion task states and identify them within network traces during application open time). If excess system and network resources deep packet inspection, we use a concept of ‘marker packets’ are provisioned than the adequate resources, user will not that are instrumented within the traffic between the thin-client perceive the benefit (e.g., user will not perceive the difference and server-side VD ends. We describe how our framework if application open time is 250ms or 500ms). Overprovisioning implementation in the form of a “VDBench benchmarking can become expensive even at the scale of tens of users given engine” is easily-deployable within existing VD hypervsior each VD requires substantial resources (e.g., 1 GHz CPU, 2 environments (e.g., ESXi, Hyper-V, Xen) and can be used GB RAM, 2 Mbps end-to-end network bandwidth). Hence, to instrument a wide-variety of existing Windows and CSPs need frameworks and tools to avoid overprovisioning, platform based thin-clients (e.g., embedded Windows 7, Win- and ultimately derive inherent benefits such as reduced data dows/Linux VNC, Linux , Linux ) [8] [9] center costs and energy savings. for monitoring VD user QoE through joint performance anal- CSPs also need frameworks and tools to monitor resource ysis of system, network and application context. allocations in an on-going manner so as to detect and trou- By using our novel offline benchmarking methodology bleshoot user QoE bottlenecks. Given the fact that CSPs to within a closed-network testbed, we derive a novel “composite a large extent can control the CPU and memory resource quality” metric model of user QoE and show how it is cus- allocations and right-provision them on the server-side, they tomizable to particular user group profiles with different appli- most critically need the frameworks and tools to detect and cation sets, and can be used to during online monitoring to: (i) troubleshoot network health related issues on the paths over identify dominant performance indicators for troubleshooting the Internet between the thin-client and the server-side VD. bottlenecks amongst numerous factors affecting user QoE in With pertinent network measurement data analysis enabled VD application tasks, and (ii) effectively obtain both absolute through the frameworks and tools, they can perform re- and relative objective user QoE measurements needed for source adaptations that involve selecting appropriate thin-client pertinent selection/adaptation of thin-client encoding configu- protocol configurations that are resilient to network health rations. The absolute objective user QoE measurements allow degradations, and deliver optimum user QoE. comparison of a thin-client protocol’s performance (e.g., RDP) It is important to note that the on-going monitoring should to that of another thin-client protocol (e.g., PCoIP) under a be done without expensive and time-consuming subjective particular QoS condition characterized by latency and packet testing involving actual users of the VDC. From the thin-client loss in the path between the thin-client and the server-side. perspective, remote display protocols are sensitive to network Whereas, the relative objective user QoE measurements allow health i.e., QoS (Quality of Service) variations and consume as comparison of a thin-client protocol’s performance for de- much end-to-end network bandwidth available. They employ graded QoS conditions with reference to ideal QoS conditions different underlying TCP or UDP based protocols that exhibit (e.g., low latency and zero loss). varying levels of user QoE robustness under degraded network We validate the effectiveness of our composite quality conditions [3]. Also, thin-client protocol configuration (and modeling and assessment methodology using subjective and resulting VD application performance) is highly dependent objective user QoE measurements in a real-world VDC fea- on application content characteristics (i.e., text, image, video) turing RDP/PCoIP thin-client protocols, and actual users (i.e., and improper configuration can greatly impact user QoE due faculty and students) for a virtual classroom lab use case to lags in screen updates as well as keyboard/mouse control within a federated university system. We demonstrate how actions [4] - [7]. the high-level of correlation with the subjective and objective It is evident from above CSP needs (i.e., capacity planning, user QoE results from this real-world VDC testing allowed thin-client protocol selection and bottleneck troubleshooting) us to determine the more suitable thin-client protocol for the that the frameworks and tools have to be built based on princi- use cases. We also show how it helped in the assessment ples of human-centric performance modeling and assessment that the VDC infrastructure configuration adopted for the use that can ensure peak user productivity and high satisfaction case had no inherent usability bottlenecks and was capable of levels of users in VDCs. In this paper, we present a novel ref- effectively delivering satisfactory user QoE. erence architecture for modeling and assessment of objective The remainder of the paper is organized as follows: Sec- user QoE within VDCs, without the need for expensive and tion II describes related work. In Section III, we present time-consuming subjective testing. The architecture involves the reference architecture and its component functions and offline benchmarking of VD application tasks (e.g., time taken interactions, especially the user workload generation and slow- to open Excel application; time for a video file playback) motion benchmarking approaches. Section IV presents our performance under ideal network health conditions, and mod- human-centric composite quality metric formulation through eling the degradation in performance for different thin-client closed-network testbed experiments. Section V validates the configurations under a broad sample of deteriorating network composite quality metric use in a real-world VDC deployment. health conditions. Section VI concludes the paper. II.RELATED WORK of those benchmarks. There are alternate approaches adopted There are several works that outline the general architecture in thin-client performance benchmarking toolkits such as [6] considerations and requirements (e.g., isolation, scalability, and [7] that focus on recording and playback of keyboard dynamism, privacy) of an end-to-end system and network and mouse events on the client-side. However, none of the monitoring framework in cloud environments [10] - [13]. In earlier thin-client benchmarking toolkits consider modeling most cases, they propose re-use of traditional tools (e.g., active and mapping server-side events and thin-client objective user measurement tools such as Ping; passive measurement tools QoE for degrading network health conditions in an integrated such as Wireshark) and server-side methods for monitoring manner as proposed in our approach. QoS related factors. For example, the authors in [11] instru- Earlier studies such as [15] have compared different thin- ment only the server-side virtual machine with measurement client protocols using metrics such as bandwidth/memory collection scripts. consumption, especially the popular RDP and PCoIP protocols In contrast, our emphasis is human-centric quality assess- to find suitability for different application tasks. Also, we ment, and our method is to instrument both the thin-clients and remark that our implementation of the framework builds upon server-side VDs with measurement collection scripts. These our early work on VDBench toolkit [3] that can be used scripts feed measurements taken from traditional active and for offline benchmarking of VDCs to create user group files passive measurement tools into a benchmarking engine to based on system and network resource consumption charac- correlate QoS and QoE measurements and build a corre- teristics; an online monitoring perspective of the VDBench sponding historical monitoring data set. Moreover, our method toolkit was not considered. In comparison to these works, is very similar to the method in [13] where performance this paper contributions are more comprehensive in terms bottlenecks are detected during run-time or online, when QoE of the framework architecture, implementation, as well as related metrics exceed known benchmarks that are obtained comparison and selection of thin-client protocols within VDCs through offline testing and analysis. Further, similar to the based on offline benchmarking and online monitoring. Further, approach in [2], we use the historical monitoring data set in our work is novel with regards to our aim to derive a novel our framework to adapt resource allocations, specifically we composite quality metric function that maps QoS and QoE improve user QoE and avoid overprovisioning of resources. metrics relating to any of the thin-client protocols, especially The importance of human-centric quality assessment frame- the dominant metrics for various profiles of VD application works for cloud environments has been highlighted in works sets of user groups. such as [14]. They suggest the offline assessment and online Other works such as [16] and [17] have used neural monitoring should be based on ‘profiles’ of user workloads networks and regression techniques, respectively to establish and corresponding user QoE expectations. Our workload gen- relationship between QoS and QoE for particular application eration methodology is based on the realistic emulation of contexts within multimedia content delivery. In our work, we various user group profiles based on the application sets and adopt curve-fitting techniques to obtain absolute and relative their corresponding tasks. We adopt a hierarchical state based objective user QoE for comparing the thin-client protocols workload generation method described in [19], where actual over a broad sample of degraded QoS conditions under dif- system user’s behavior is used to represent the workload ferent interactive operations, and validate their accuracy by characteristics at a high-level. This in turn results in a sequence correlation with subjective user QoE scores. Note that authors of workload requests at the lower level that can be customized in [18] emphasize the need for new metrics of user QoE for a particular user group profile and are distinguishable (with quantification, as they recognize that to date, tracking user marker packets in our case) within network traces. QoE related metrics is not precise, especially at cloud-scale. Coupled with the workload generation, we adopt slow- III.REFERENCE ARCHITECTURE motion benchmarking for network trace analysis to select suitable thin-client protocol configurations for a particular user In this section, we first describe the conceptual workflow group profile. Our motivation to use slow-motion benchmark- of our offline/online objective user QoE modeling and assess- ing for thin-client performance assessment is as follows. In ment steps for VDCs. Following this, we describe in detail advanced thin-client protocols (e.g., RDP, PCoIP), the server- the user workload generation and slow-motion benchmarking side does all the compression and sends “screen scraping with approaches we have adopted within these steps. multimedia redirection” where a separate channel is opened between the thin-client and server-side VD to send multimedia A. Conceptual Workflow content in its native format, which is then rendered in the Fig. 2 shows the two major steps for our virtual desktop user appropriate screen portion at the thin-client. As a result, tra- QoE modeling and assessment: (i) offline benchmarking and ditional packet capture techniques using TCPdump (or Wire- modeling, and (ii) online monitoring. In the offline step, we shark) cannot directly allow VD application task performance collect a set of benchmarks through user workflow generation measurements. To overcome this challenge, we adopt and followed by slow-motion benchmarking under different QoS significantly extend the “slow motion benchmarking” tech- emulation conditions under controlled testing configurations nique developed originally in [4] and [5] for legacy thin-client of a VDC. The QoS emulation involves varying the delay protocols such as Sun Ray. The technique introduces artificial and loss levels between the thin-client and the server-side VD delay in between screen events of tasks being benchmarked, over a broad sample range. The benchmarks are collected by which allows isolation and full rendering of visual components averaging results from multiple experiment runs in the form Fig. 3. User workload generation finite state machine example

Fig. 2. Conceptual workflow for virtual desktop user QoE modeling and assessment tivity end’ (i.e., application close task). The states of a VD application can be scripted using Windows GUI frameworks such as AutoIT [21], and these scripts can be launched on of objective QoE metrics that correspond to VD application VDs within the VDC. Measuring the progress of these states tasks (e.g., time taken to open Excel application, or time for on the server-side by recording and analyzing timestamps a video file playback, or quality of a video file playback). for different actions can be effective in understanding the The benchmark data collected for a particular user group interaction response times of VD applications perceived at the profile with a certain application set allows a CSP to under- thin-client side. While doing so, controllable delays due to user stand the VD application performance under ideal conditions. behaviors such as ‘think time’ can be introduced for achieving It also provides a model for mapping how the performance greater realism in user workload generation. may degrade for different thin-client configurations under deteriorating network health conditions measured through Workload templates can be setup by combining a series common active or passive measurement tools. By analyzing of such individual finite state machines as part of a parent the performance under degraded conditions, those metrics state machine for orchestrating different VD applications such that are most sensitive to network health fluctuations can be as Microsoft Excel, Internet Explorer and Windows Media identified as the “dominant metrics”. The dominant metrics are Player in random sequences. Depending upon the workload key performance indicators as in the case of online monitoring. performance measurement needs on the VDs, the state ma- They can be relied upon for validating the fact that the VDC chines can be modified and re-distributed on the VDs with is functioning well, and also in some cases for diagnosing the appropriate ‘marker packets’. The marker packets needed to cause of unsatisfactory user QoE feedback provided by actual identify the different tasks within the network traces are sent users during productive use of the VDC. The subjective user from the workload scripts between the server-side VD and QoE is captured using the popular Mean Opinion Score (MOS) the thin-client on ports that are non-standard in protocol com- ranking scale of 1-5 [20]; [1, 3) range being Poor grade, munications for subsequent filtering during post-processing. [3, 4) range being Acceptable grade and [4, 5] range being The marker packets contain information that describe the Good grade of subjective user satisfaction. Hence, the models application under test (e.g., Internet Explorer), the activity can be obtained as closed-form expressions for objective event (e.g., type of web-page being downloaded), event’s QoE metrics that are composite quality functions of dominant start/stop timestamps, and other meta data (e.g., emulated metrics, where higher weights are assigned for relatively more network condition configured for the test). dominant metrics. These objective QoE metrics in turn can C. Slow-motion Benchmarking be used to correlate performance with actual user QoE MOS rankings. Figs. 4 and 5 show the slow-motion benchmarking packet traces with marker packets (represented as red dots) for PCoIP B. User Workload Generation and RDP, respectively as viewed in Wireshark for a Low Fig. 3 shows an example finite state machine used in the user resolution image page load within Internet Explorer. We can workload generation that has various VD application states observe through deep-packet inspection that artificial delays corresponding to actions performed by an user in relation to are introduced in between screen events of VD application ‘productivity start’ (i.e., application open task), ‘productivity tasks to mainly ensure that the tasks visual components are progress’ (i.e,. application functions usage tasks) and ‘produc- isolated and fully rendered on the thin-client. The network (a) delay(ms), loss(%) = 0, 0 (b) delay(ms), loss(%) = 50,0 (c) delay(ms), loss(%) = 200, 0

(d) delay(ms), loss(%) = 0, 3 (e) delay(ms), loss(%) = 50, 3 (f) delay(ms), loss(%) = 200, 3 Fig. 4. PCoIP slow-motion benchmark traces for Low resolution image page load within Internet Explorer

(a) delay(ms), loss(%) = 0, 0 (b) delay(ms), loss(%) = 50, 0 (c) delay(ms), loss(%) = 200, 0

(d) delay(ms), loss(%) = 0, 3 (e) delay(ms), loss(%) = 50, 3 (f) delay(ms), loss(%) = 200, 3 Fig. 5. RDP slow-motion benchmark traces for Low resolution image page load within Internet Explorer

packet traces are analyzed both at the client and server sides between the client and server sides. These network health to measure the VD task performance for ideal and degraded combinations are typical of ‘well managed’, ‘partially-well network health conditions. In the Figs., we visually show managed’ and ‘poorly managed’ network path characteristics through the packet counts along y-axis over task time along x- on the Internet, and hence have been used in this paper axis, how the VD task performance degrades at the thin-client hereafter as a representative sample space of network health side across a broad sample of network health conditions. conditions to describe our proposed composite quality model- ing approach. They also are sufficient as seen in the Figs. to For the systematic emulation of network health conditions, expose the best case and worst case performance of the VD we use combinations of 0ms, 50ms, and 200ms delay and 0% application tasks. More detailed samples of network health and 3% loss introduced by the Netem network emulator [22] can be considered to obtain more fine-grained composite VD hypervisor environments (e.g., ESXi, Hyper-V, Xen) and quality models, however such a data collection and modeling relies on their APIs to authenticate and authorize VD requests, is beyond the scope of this paper. allocate resources and reserve them. Utilities such as psexec The performance differences are apparent in the crispness of are used to remotely invoke workload generation scripts, and the network utilization patterns, and in the higher bandwidth hence our implementation can also be used to instrument consumption (indicated by the packet count values along y- existing Windows and Linux platform based thin-clients (e.g., axis) and lower task times under ideal conditions, compared embedded Windows 7, Windows/Linux VNC, Linux Thinsta- to the lower bandwidth consumption and higher task times for tion, Linux Rdesktop) [8] [9]. Moreover, our implementation degraded conditions. In addition, the Figs. illustrate visually allows for storing of performance measurements, and for how the popular PCoIP and RDP thin-client protocols handle joint performance analysis of system, network and application the degraded conditions for a particular application task (Low context to online measure and monitor VD user QoE. resolution image page load within Internet Explorer) and apply protocol-specific compensations. The compensations manifest B. Closed-form Expressions within the transmission of the visual components of the VD application to the client side and affect remote user consump- Table I shows the various objective QoE metrics collected tion and productivity, and ultimately the user QoE. by the VDBench benchmarking engine. Although we collected We can see that PCoIP protocol consumes relatively less and analyzed over 20 metrics, we chose only 7 metrics (M1 - network bandwidth for delivering the same satisfactory user M7) shown in Table I that were observed to have significant QoE as the RDP protocol under ideal conditions. Further, we trends in terms of user QoE impact, as network health condi- can see that PCoIP has a tight render time even in worst tions degraded for both the RDP and PCoIP protocols that we cases (i.e., delay - 200ms and loss - 3% condition in our considered to test and compare. In other terms, we ignored the sample space) in comparison to RDP. Any higher values of metrics of tasks whose packet count variations over task times degraded network condition than the worst case values we were not significant in the best case and worst case network choose for delay and loss delivers only poor user QoE as health conditions, and hence are not helpful in identifying QoE evident in the slow-motion benchmarking traces, and hence bottleneck scenarios in VD applications. higher sample values were not considered in our setup. This M1 - M6 are obtained by subtracting the timestamps be- difference is due to the fact that PCoIP uses UDP as the tween the marker packets for an application task at the thin- underlying transmission protocol, whereas RDP is TCP-based client side by comparing with the ideal QoS condition values. and performs retransmissions under lossy network conditions. However, M7 is calculated differently: a video is first played back at 1 frame per second (fps) in an atomic manner and net- IV. COMPOSITE QUALITY FORMULATION work trace statistics are captured. The video is then replayed at full speed a number of times in an aggregate manner through In this section, we describe the closed-network setup and the thin-client protocol under test, and over various network experimentation conducted to formulate the composite quality health conditions. A challenge in performance comparisons functions for PCoIP and RDP thin-client protocols. involving UDP and TCP based thin-client protocols in terms of A. Closed-network Testbed video quality is to derive a normalized metric. The normalized metric should account for fast completion times with image Fig. 6 shows our actual implementation of the reference impairments in UDP based thin-client protocols, in comparison architecture described in Section III in the form of a “VD- to long completion times in TCP based thin-clients with no Bench benchmarking engine” within a VDC at VMLab [23]. impairments but long frame freezes. Towards meeting this The physical components setup and interactions are organized challenge, we use the video quality metric shown in Equation in 3 layers: thin-client sites, middleware services, and server- (1) that was originally developed in [5]. This M metric relates side virtual desktops. At the thin-client sites, the testing 7 the slow-motion playback in an atomic manner to the full environment with the closed-network is used for our offline speed playback in an aggregate manner to see how many experiments to formulate the composite quality functions. frames were dropped, merged, or otherwise not transmitted. The validation experiments described later in Section V-A use the same infrastructure for the middleware services and Data Transferred (aggregate fps) server-side, however they connect via the Internet from geo- Render Time (aggregate fps) video IdealT ransfer(aggregatefps) graphically distributed locations. We setup two virtual desktop Qplay = Data Transferred (atomic fps) (1) environments, one with the open-source Apache VCL [24] Render Time (atomic fps) IdealT ransfer(atomicfps) and other with VMware VDI [25] whose default thin-client protocols for VD access are RDP and PCoIP, respectively. Amongst these 7 metrics, the metrics that were most sensi- The VDBench benchmarking engine automates and orches- tive to change in network health conditions represent the more trates: (i) initial workflow steps 1 to 5 shown in Fig. 6, dominant metrics that impact user QoE. For other application (ii) subsequent thin-client session setup with either RDP sets and user expectations of VD application performance, or PCoIP, and (iii) final objective user QoE measurements the number of metrics could vary. In any case, the composite obtained via integrated user workload generation and slow- quality function to predict user QoE given values of n metrics motion benchmarking for various network health conditions. M1, M2 ... Mn) can be calculated using a general form as Our implementation is easily-deployable within any existing follows - Fig. 6. User QoE modeling and assessment framework: physical components setup and interactions

TABLE I NOTATIONS

Notation Definition excel Topen (a.k.a. M1) “Excel Open Time” (s) is the time taken for Excel application to open excel Trender (a.k.a. M2) “Excel Render Time” (s) is the time taken for Excel application to render sample text low img Tload (a.k.a. M3) “Low Resolution Image Load Time” (s) is the time taken for Internet Explorer to load a web-page with a sample low resolution image of US Constitution high img Tload (a.k.a. M4) “High Resolution Image Load Time” (s) is the time taken for Internet Explorer to load a web-page with a sample high resolution image of US Constitution text Tload (a.k.a. M5) “Text Load Time” (s) is the time taken for Internet Explorer to load a web-page with sample text video Tplay (a.k.a. M6) “Media Playback Time” (s) is the time taken for Windows Media Player to playback a sample video video Qplay (a.k.a. M7) “Media Playback Quality” is the playback quality of the sample video in Windows Media Player CQS “Composite Quality Score” calculated as shown by the composite quality equation RCQS “Relative Composite Quality Score” is the normalized CQS obtained from relative calculation in Eqn. (3) ACQS “Absolute Composite Quality Score” is the normalized CQS obtained from absolute calculation in Eqn. (4) AMOS “Absolute Mean Opinion Score” is the average score given by real users during subjective QoE testing

obtain closed-form expressions that best-fit the training data curve. Since the dominant metrics in our case were the same W1M1 + W2M2 + ··· + WnMn for both RDP and PCoIP, the same equation is applicable to where each metric Mi, i{1, n} has a corresponding weight estimate CQS for RDP and PCoIP. Note that M7 was found Wi, i{1, n}, whose value is determined by how sensitive the to be a very highly dominant metric compared to rest of the metric is with network health degradations, or how much of metrics, hence we took the square root value for this metric to a key performance indicator that metric represents. We assign balance its influence in relation to the other M1-M6 metrics. a weight value by calculating the change in the measurement value of the metric per unit change in QoS i.e., change in TABLE II delay and loss. Table II shows the normalized weights of the NORMALIZED WEIGHTSSHOWING DOMINANT METRICS dominant metrics derived for PCoIP and RDP protocols from the packet traces in our closed-network testbed experiments Metric PCoIP Weight RDP Weight excel involving the systematic emulation of network health condi- Topen 0.06 0.08 excel tions described in Section III-C. Combining these weights and Trender 0.01 0.03 low img metrics, we derived a closed-form expression shown in Eqn. Tload 0.26 0.13 high img (2) that can be used to predict the composite quality function Tload 0.22 0.14 text for our setup. Tload 0.07 0.08 video Tplay 0.09 0.30 √ Qvideo 0.26 0.21 CQS = W7M7− play   M6 W1M1 + W2M2 + W3M3 + W4M4 + W5M5 + W6 (2) 5 For comparing the thin-client protocols in relative and The above function was derived by using a trial-and-error absolute terms, we define RCQS and ACQS terms as shown fashion with increasing complexity in relation to Eqn. (1) to in Eqns. (3) and (4), respectively. Subsequently, the participants were asked to conduct objec- CQS − CQS tive QoE testing to obtain a more apples-to-apples comparison RCQS = curr 200,3 ∗ 100 (3) CQS0,0 − CQS200,3 of user QoE at the different sites, and eliminate any outlier- biases or mood-effects of participants that reflect in their

RDP P CoIP subjective opinion of VD application user QoE. As part of the CQScurr − min(CQS , CQS ) ACQS = 200,3 200,3 ∗ 100 (4) objective QoE testing, participants downloaded, installed and max(CQSRDP , CQSP CoIP ) − CQS 0,0 0,0 200,3 ran the OSC/OARnet VDBench software (client screenshot RCQS and ACQS are normalized between 0 and 1 and shown in Fig. 7) for both Apache VCL and VMware VDI are calculated in form of percentages using the CQS for remote thin-clients. The installation prerequisites included the the current, ideal and worst network health conditions for latest Java runtime environment, Wireshark and two clients RDP and PCoIP thin-client protocols. For example, in our in the form of JAR files, one for Apache VCL and another experiments case, the CQScurr that refers to the composite for VMware VDI. The VDBench client software implements quality score being calculated for a given network health the client-side portions of our user workload generation and condition is compared with the CQS with delay and loss slow-motion benchmarking methodologies explained in Sec- with values of ideal case (i.e., 0,0) and the worst case (i.e., tion III-B and III-C, respectively. It can run on both Windows 200, 3). This allows us to compare how good the user QoE and Linux platforms, and has capabilities for NIC selection for performance is achieved in relation to the ideal case (i.e., the test initiation, and interacts with the VDBench benchmarking relative objective QoE score) for a thin-client protocol under engine at the server-side through messages encoded within test. Moreover, this also allows us to compare the thin-client marker packets. It is also secure for use in VDCs as it requires protocols which may have different weights applied to the an authorized participant to input a username and password same metrics under degraded conditions (i.e., the absolute that is valid within the Active Directory on the server-side. objective QoE score). Tables III and IV show RDP and PCoIP The VDBench client executes a series of automated tests composite quality calculations and the RCQS and ACQS over a span of several minutes within the participants remote values obtained in our closed-network testbed. We can readily thin-client in order to simulate or mimic the actions performed observe that RDP and PCoIP protocols perform differently by participants during the subjective testing over the Internet. for varying application contexts and network health condition While performing the tests within an instance of either Apache degradations. VCL or VMware VDI, the software records quantitative performance information in terms of interactive application V. VALIDATION RESULTS response times, and video playback quality metrics i.e., M1 - In this section, we validate the effectiveness of our compos- M7 shown in Table I that can be used to identify bottlenecks ite quality modeling and assessment methodology in a real- and to correlate with subjective user QoE rankings i.e., MOS world VDC. We first describe how the validation testbed was rankings of participants. The collected measurements after a setup for user trials. Following this, we describe results from test run within an instance of either Apache VCL or VMware our subjective and objective user QoE measurements analysis. VDI are displayed to the user in the VDBench client user interface, and a copy is stored in a database automatically on A. VDPilot Testbed the server-side. We used the production environment shown in Fig. 6 that is similar to the testing environment (without the network emula- B. Subjective QoE Measurements Correlation tor) as a ‘VDPilot’ testbed for our validation experiments. The Table V shows the subjective and objective user QoE results testbed features RDP/PCoIP thin-client protocols, and actual from testing in the VDPilot testbed. The AMOS rankings data users for a virtual classroom lab use case within a federated shown is the average of the user QoE feedback provided by the university system. A total of 36 users registered in the VDPilot, actual users. We can observe that both the AMOS rankings for and the users were mainly faculty and students within a diverse RDP and PCoIP were above the 4 range (i.e., PCoIP ranking selection of Ohio-based universities that included: The Ohio was 4.74 and RDP ranking was 4.21), which refers to ‘Good’ State University, The University of Dayton, The University of grade user QoE. In addition, the ACQS values were high and Akron, Ohio University, Denison University, Walsh University, correlate with our closed-network testbed’s baseline results for Sinclair University, Ashland University and Baldwin Wallace a network health with 50ms delay, and 0% packet loss, which College. is expected given the participant locations over a regional- As part of subjective testing activities within the VDPilot level wide-area network. Thus, we were able to assess that testbed, participants were asked to compare Apache VCL the VDC infrastructure configuration adopted for the VDPilot (configured with default RDP) and VMware VDI (configured use case had no inherent usability bottlenecks and was capable with default PCoIP) remote thin-clients while performing tasks of effectively delivering satisfactory user QoE for Ohio-based in the virtual desktops using applications such as Excel, Win- universities. dows Media Player, and Internet Explorer. After completing We can also observe that the AMOS rankings closely subjective testing, participants were asked to complete an correlate with the objective user QoE measurements given online survey to provide feedback about their perceived user by ACQS, which was calculated using Eqn. (4) for both QoE while accessing virtual desktop applications with Apache the RDP and PCoIP protocols. Further, we can conclude that VCL and VMware VDI. the PCoIP thin-client protocol is the more suitable thin-client TABLE III RDPCOMPOSITE QUALITY CALCULATION IN CLOSED-NETWORK TESTBED

excel excel low img high img text video video Delay (ms) Loss (%) Topen Trender Tload Tload Tload Tplay Qplay CQS RCQS ACQS 0 0 1.12 20.63 0.32 0.52 0.49 7.00 7.66 −0.13 100.00 89.67 50 0 0.94 20.85 0.98 1.13 0.51 50.65 1.37 −3.72 79.48 71.38 200 0 1.09 20.70 0.42 0.41 0.95 220.65 0.26 −14.33 18.78 17.07 0 3 1.21 20.64 0.46 1.04 0.50 6.65 6.07 −0.35 98.74 88.55 50 3 1.13 20.81 1.30 1.76 0.43 62.54 0.89 −4.70 73.90 66.50 200 3 1.40 20.58 0.42 0.68 0.41 273.58 0.21 −17.61 0.00 0.00

TABLE IV PCOIPCOMPOSITE QUALITY CALCULATION IN CLOSED-NETWORK TESTBED

excel excel low img high img text video video Delay (ms) Loss (%) Topen Trender Tload Tload Tload Tplay Qplay CQS RCQS ACQS 0 0 0.91 20.51 0.18 0.34 0.26 5.98 22.78 1.86 100.00 100 50 0 0.88 20.56 0.23 0.46 0.29 8.80 11.45 1.05 64.47 95.84 200 0 0.96 20.46 0.39 0.71 0.39 9.91 10.49 0.85 55.45 94.47 0 3 0.77 20.52 0.17 0.36 0.22 6.00 0.80 −0.12 12.62 89.94 50 3 0.91 20.60 0.24 0.43 0.27 8.45 0.69 −0.25 7.02 89.12 200 3 0.90 20.50 0.35 77 0.51 6.03 0.45 −0.41 0.00 88.34

TABLE V SUBJECTIVEAND OBJECTIVE QOEMEASUREMENTS CORRELATION RESULTS

excel excel low img high img text video video Protocol Topen Trender Tload Tload Tload Tplay Qplay CQS ACQS AMOS RDP 7.66 23.79 0.71 1.01 1.12 17.54 5.50 −1.81 81.11 4.21 PCoIP 2.19 20.79 0.22 0.41 0.32 7.65 11.46 0.99 95.76 4.74

VI.CONCLUSION In this paper, we presented a novel human-centric refer- ence architecture and its easily-deployable implementation for modeling and assessment of objective user QoE within VDCs, without the need for expensive and time-consuming subjective testing. The architecture integrated finite state machine rep- resentations for user workload generation, and slow-motion benchmarking with deep packet inspection of application task performance affected by network health i.e., QoS variations to derive a “composite quality” metric model of user QoE. We showed how this metric is customizable to a particular user group profile with different application sets, and can be used to: (i) identify dominant performance indicators for troubleshooting bottlenecks, and (ii) effectively obtain both ‘absolute’ and ‘relative’ objective user QoE measurements needed for pertinent selection/adaptation of thin-client encod- ing configurations within VDCs. Our framework and its implementation in the form of a ‘VDBench benchmarking engine’ on the server-side, and ‘Java-based VDBench client’ on the thin-client side can be adopted by CSPs within existing VD hypervsior environ- Fig. 7. Java VDBench Client ments (e.g., ESXi, Hyper-V, Xen) and can be extended to instrument a wide-variety of existing Windows and Linux platform based thin-clients (e.g., embedded Windows 7, Win- dows/Linux VNC, Linux Thinstation, Linux Rdesktop). Ef- protocol for the virtual classroom lab use cases given its fective adoption of our framework and implementation for ACQS ranking of 95.76, in comparison to the ACQS ranking monitoring VD user QoE through joint performance analysis of 81.11 for the RDP thin-client protocol. It is relevant to of system, network and application context can ultimately remark that ACQS is the more relevant objective QoE metric enable CSPs to have: (a) satisfied VD users, (b) high VD than RCQS to compare with AMOS since we are comparing application productivity levels, and (c) reduced VDC support the performance between 2 different thin-client protocols. costs due to increased performance transparency. We validated the effectiveness of our composite quality [23] P. Calyam, A. Berryman, A. Lai, M. Honigford, “VMLab: Infrastructure modeling and assessment methodology using subjective and to Support Desktop Virtualization Experiments for Research and Educa- tion”, VMware Technical Journal, 2012; http://vmlab.oar.net. objective user QoE measurements in a real-world VDC viz., [24] M. Vouk, A. Rindos, S. Averitt, et. al., “Using VCL technol- VDPilot featuring RDP/PCoIP thin-client protocols, and actual ogy to implement distributed reconfigurable data centers and com- users for a virtual classroom lab use case within a regional- putational services for educational institutions”, IBM Journal of Re- search and Development, Vol. 53, No. 4, Pages 509-526, 2009; level federated university system. High-level of correlation https://cwiki.apache.org/VCL/apache-vcl.html. with the subjective and objective user QoE results from testing [25] VMware Virtual Desktop Infrastructure and VMware View - allowed us to determine PCoIP as the more suitable thin- http://www.vmware.com client protocol for the virtual classroom lab use case. We also showed we were able to assess that the VDC infrastructure configuration adopted for the use case had no inherent us- ability bottlenecks and was capable of effectively delivering satisfactory user QoE over the Internet at a regional-level.

REFERENCES [1] P. Calyam, R. Patali, A. Berryman, A. Lai, R. Ramnath, “Utility- directed Resource Allocation in Virtual Desktop Clouds”, Elsevier Computer Networks Journal (COMNET), 2011. [2] L. Deboosere, B. Vankeirsbilck, P.Simoens, et. al., Cloud-based Desktop Services for Thin Clients, IEEE Internet Computing, 2011. [3] A. Berryman, P. Calyam, M. Honigford, A. Lai, “VDBench: A Bench- marking Toolkit for Thin-Client Based Virtual Desktop Environments”, Proc. of IEEE CloudCom, 2010. [4] J. Nieh, S. Yang, N. Novik, “Measuring thin-client performance using Slow-motion benchmarking”, ACM Transactions on Computer Systems, Vol. 21, No. 1, Pages 87-115, 2003. [5] A. Lai, J. Nieh, “On The Performance Of Wide-Area Thin-Client Com- puting”, ACM Transactions on Computer Systems, Vol. 24, No. 2, Pages 175-209, 2006. [6] J. Rhee, A. Kochut, K. Beaty, “DeskBench: Flexible Virtual Desktop Benchmarking Toolkit”, Proc. of Integrated Management (IM), 2009. [7] N. Zeldovich, R. Chandra, “Interactive Performance Measurement with VNCplay”, Proc. of USENIX Annual Technical Conference, 2005. [8] Thinstation: Open-source Thin-client Operating System - http://www.thinstation.org [9] Wyse Products - http://www.wyse.com [10] P. Hasselmeyer, N. dHeureuse, “Towards Holistic Multi-Tenant Moni- toring for Virtual Data Centers”, Proc. of IEEE/IFIP NOMS Workshop, 2010. [11] S. De Chaves, R. Uriarte, C. Westphall, “Toward an Architecture for Monitoring Private Clouds”, IEEE Communications Magazine, Vol. 49, No. 12, Pages 130 - 137, 2011. [12] S. Clayman, A. Galis, C. Chapman, et. al., “Monitoring Service Clouds in the Future Internet”, Towards the Future Internet - Emerging Trends from European Research, IOS Press ISBN 978-1-60750-538-9, 2010. [13] V. Emeakaroha, M. Netto, et. al., “Towards Autonomic Detection of SLA Violations in Cloud Infrastructures”, Future Generation Computer Systems, Vol. 28, No. 7, Pages 1017-1029, 2012. [14] J. Shao, H. Wei, Q. Wang, H. Mei, “A Runtime Model Based Monitoring Approach for Cloud”, Proc. of IEEE Conference on Cloud Computing (CLOUD)”, 2010. [15] J. Kouril, P. Lambertova, “Performance Analysis and Comparison of Vir- tualization Protocols, RDP and PCoIP, Proc. of International Conference on Computers (ICCOMP), 2010. [16] S. Mohamed, G. Rubino, “A Study of Real-Time Packet Video Quality using Random Neural Networks”, IEEE Transaction on Circuits and Systems for Video Technology, Vol. 12, No. 12, Pages 1071-1083, 2002. [17] M. Fiedler, T. Hossfeld, T. Phuoc, “A generic quantitative relationship between quality of experience and quality of service”, IEEE Network, Vol. 24, No. 2, Pages 36-41, 2010. [18] L.Spracklen, B.Agrawal, R.Bidarkar, H.Sivaraman, “Comprehensive User Experience Monitoring”, VMware Technical Journal, 2012. [19] H. Hlavacs, G. Kotsis, “Modeling User Behavior: A Layered Approach”, Proc. of IEEE MASCOTS, 1999. [20] M. Pinson, S. Wolf, “A New Standardized Method for Objectively Measuring Video Quality”, IEEE Transactions on Broadcasting, ISSN: 0018-9316, Vol. 50, Pages 312322, 2004. [21] AutoIT Windows GUI Scripting Framework - http://www.autoitscript.com [22] The Netem - http://www.linuxfoundation.org/collaborate/workgroups/networking/netem